Cardinal

Some MKL environment variables have incorrect paths

MKL module files define some helper environment variables with incorrect paths.  This can yield link time errors.  All three clusters are affected.  We are working to correct the module files.   A workaround for users is to redefine the environment variable with the correct path; this requires some computational maturity.  We recommend users contact oschelp@osc.edu for assistance.  An example error from Cardinal with module intel-oneapi-mkl/2023.2.0 that defined environment variable MKL_LIBS_INT64 follows:

LS-DYNA mpp-dyna Cardinal: Remote access error on mlx5_0:1, RDMA_READ

You may encounter the following error while running mpp-dyna jobs with multiple nodes:

[c0054:22206:0:22206] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[c0054:22206:0:22206] ib_mlx5_log.c:179  RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0]
forrtl: error (76): Abort trap signal

Cause of the Error

Unknown

Affected versions

mpp-dyna versions 11, 13, when running on multiple nodes

Pages