OpenMPI 4 and NVHPC MPI Compatibility Issues with SLURM HWLOC
A pure MPI application using mpirun
or mpiexec
with more ranks than the number of NUMA nodes may encounter an error similar to the following:
A pure MPI application using mpirun
or mpiexec
with more ranks than the number of NUMA nodes may encounter an error similar to the following:
Cardinal hosted a version of bwa that had an unpatched vulnerability, 0.7.17.
This version has been removed from Cardinal in favor of 0.7.18
You may encounter the following error while running mpp-dyna jobs with multiple nodes:
[c0054:22206:0:22206] ib_mlx5_log.c:179 Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [c0054:22206:0:22206] ib_mlx5_log.c:179 RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0] forrtl: error (76): Abort trap signal
Unknown
mpp-dyna versions 11, 13, when running on multiple nodes
You may encounter the following error while running Ansys on Cardinal:
OMP: Error #100: Fatal system error detected. OMP: System error #22: Invalid argument forrtl: error (76): Abort trap signal
Set the environment variable KMP_AFFINITY=disabled
before running Ansys
You may encounter the following error while running an Abaqus parallel job with PMPI:
When running a full-node MPI job with MVAPICH 3.0 , you may encounter the following warning message:
When running MPI+OpenMP hybrid code with the Intel Classic Compiler and MVAPICH 3.0, you may encounter the following warning message from hwloc:
Users may encounter the following errors when compiling a C++ program with GCC 13:
error: 'uint64_t' in namespace 'std' does not name a type
or
Several applications using OpenMPI, including HDF5, Boost, Rmpi, ORCA, and CP2K, may fail with errors such as
mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
or
Caught signal 11: segmentation fault
We have identified that the issue is related to HCOLL (Hierarchical Collectives) being enabled in OpenMPI.
STAR-CCM+ encounters errors when running MPI jobs with Intel MPI or OpenMPI, displaying the following message:
ib_iface.c:1139 UCX ERROR Invalid active_speed on mlx5_0:1: 128
This issue occurs because the UCX library (v1.8) bundled with STAR-CCM+ only supports Mellanox InfiniBand EDR, while Mellanox InfiniBand NDR is used on Cardinal. As a result, STAR-CCM+ fails to correctly communicate over the newer fabric.
18.18.06.006, 19.04.009 and possibly later versions