Resolution:
Resolved
Workaround Link:
Affected Software:
You may encounter the following error while running mpp-dyna jobs with multiple nodes:
[c0054:22206:0:22206] ib_mlx5_log.c:179 Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [c0054:22206:0:22206] ib_mlx5_log.c:179 RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0] forrtl: error (76): Abort trap signal
Cause of the Error
This issue occurs because the UCX library bundled with ls-dyna only supports Mellanox InfiniBand EDR, while Mellanox InfiniBand NDR is used on Cardinal. As a result, ls-dyna fails to correctly communicate over the newer fabric.
Affected versions
mpp-dyna versions 11, 13, and 15, when running on multiple nodes on Cardinal
Workaround
The solution is to bypass the UCX library for MPI communication by configuring the environment variables appropriately:
For Intel MPI:
export FI_PROVIDER="verbs"
For OpenMPI:
export OMPI_MCA_btl_openib_allow_ib=1
Set these variables before executing the mppdyna command.