LS-DYNA mpp-dyna Cardinal: Remote access error on mlx5_0:1, RDMA_READ

Category: 
Resolution: 
Resolved
Affected Software: 

You may encounter the following error while running mpp-dyna jobs with multiple nodes:

[c0054:22206:0:22206] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[c0054:22206:0:22206] ib_mlx5_log.c:179  RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0]
forrtl: error (76): Abort trap signal

Cause of the Error

This issue occurs because the UCX library bundled with ls-dyna only supports Mellanox InfiniBand EDR, while Mellanox InfiniBand NDR is used on Cardinal. As a result, ls-dyna fails to correctly communicate over the newer fabric.

Affected versions

mpp-dyna versions 11, 13, and 15, when running on multiple nodes on Cardinal

Workaround

The solution is to bypass the UCX library for MPI communication by configuring the environment variables appropriately:

For Intel MPI:

export FI_PROVIDER="verbs"

For OpenMPI:

export OMPI_MCA_btl_openib_allow_ib=1

Set these variables before executing the mppdyna command.