A pure MPI application using mpirun
or mpiexec
with more ranks than the number of NUMA nodes may encounter an error similar to the following:
-------------------------------------------------------------------------- Open MPI tried to bind a new process, but something went wrong. The process was killed without launching the target application. Your job will now abort. Local host: c0313 Application name: bin/placement_mpi Error message: hwloc_set_cpubind returned "Error" for bitmap "96" Location: rtc_hwloc.c:382 --------------------------------------------------------------------------
This occurs because OpenMPI attempts to bind a CPU that is already restricted by SLURM.
Additionally, an MPI+OpenMP application using mpirun
or mpiexec
may either bind multiple MPI ranks to the same socket or assign multiple threads to the same CPU core, leading to performance degradation.
Cause of the issue
We believe these issues stem from the HWLOC version shipped with OpenMPI or NVHPC being incompatible with the SLURM cgroup plugin or the system-wide HWLOC, as reported in this issue.
Affected versions
- OpenMPI: All openmpi/4.1.6 on Cardinal
- NVHPC: nvhpc/24.11 and nvhpc/25.1 on Asecnd and Cardinal
Workaround
To avoid these issues, use srun
to launch MPI applications instead of mpirun
or mpiexec
.