OpenMPI 4 and NVHPC MPI Compatibility Issues with SLURM HWLOC

Resolution: 
Unresolved

A pure MPI application using mpirun or mpiexec with more ranks than the number of NUMA nodes may encounter an error similar to the following:

--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        c0313
  Application name:  bin/placement_mpi
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "96"
  Location:          rtc_hwloc.c:382 
-------------------------------------------------------------------------- 

This occurs because OpenMPI attempts to bind a CPU that is already restricted by SLURM.

Additionally, an MPI+OpenMP application using mpirun or mpiexec may either bind multiple MPI ranks to the same socket or assign multiple threads to the same CPU core, leading to performance degradation.

Cause of the issue

We believe these issues stem from the HWLOC version shipped with OpenMPI or NVHPC being incompatible with the SLURM cgroup plugin or the system-wide HWLOC, as reported in this issue.

Affected versions

  • OpenMPI: All openmpi/4.1.6 on Cardinal
  • NVHPC: nvhpc/24.11 and nvhpc/25.1 on Asecnd and Cardinal

Workaround

To avoid these issues, use srun to launch MPI applications instead of mpirun or mpiexec.