A partial-node MPI job may fail to start using mpiexec
from intelmpi/2019.3
and intelmpi/2019.7
with error messages like
[mpiexec@o0439.ten.osc.edu] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:532): downstream from host o0439 was killed by signal 11 (Segmentation fault) [mpiexec@o0439.ten.osc.edu] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2114): assert (exitcodes != NULL) failed
/var/spool/torque/mom_priv/jobs/11510761.owens-batch.ten.osc.edu.SC: line 30: 11728 Segmentation fault
/var/spool/slurmd/job00884/slurm_script: line 24: 3180 Segmentation fault (core dumped)
If you are using Slurm, make sure the job has CPU resource allocation using #SBATCH --ntasks=N
instead of
#SBATCH --nodes=1 #SBATCH --ntasks-per-node=N