Software

MVAPICH 3.0 hang due to PMI mismatch with Slurm

Applications such as Quantum ESPRESSO, LAMMPS, and NWChem experienced hangs with MVAPICH 3.0 due to a PMI mismatch. MVAPICH 3.0 was built with PMI-1, while newer Slurm versions on RHEL 9 use PMI-2. Although the development team states that using the PMI-1 interface with Slurm’s PMI-2 implementation should work, there may be a bug in MVAPICH 3.0.

We are currently testing MVAPICH 4.1 and plan to migrate the software stack associated with MVAPICH 3.0 to MVAPICH 4.1 in the coming weeks.

Performance issues with MVAPICH2 on Cardinal

We have observed that several applications built with MVAPICH2, including Quantum ESPRESSO 7.4.1, HDF5, and OpenFOAM, may experience poor performance on Cardinal. We suspect this issue could be related to the newer network devices or drivers. Since MVAPICH2 is no longer supported, we recommend switching to MVAPICH 3.0 or another MPI implementation to ensure continued performance and stability in your work.

MPI fails with UCX 1.18

After the downtime on August 19, 2025, users may encounter UCX errors such as:

UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable

when running a multi-node job with intel-oneapi-mpi/2021.10.0, mvapich/3.0 or openmpi/5.0.2.

Poor performance with hybrid MPI+OpenMPI jobs and more than 4 MPI Tasks on multiple nodes

RELION versions prior to 5 may exhibit suboptimal performance in hybrid MPI+OpenMP jobs when the number of MPI tasks exceeds four across multiple nodes.

Workaround

If possible, limit the number of MPI tasks to four or fewer to achieve optimal performance. Alternatively, consider upgrading to RELION version 5 or later, as these newer releases may include optimizations and improvements that resolve this performance issue.

Pages