Slurm Migration Issues

This page documents the known issues for migrating jobs from Torque to Slurm.


Please be aware that $PBS_NODEFILE is a file while $SLURM_JOB_NODELIST is a string variable. 

The analog on Slurm to cat $PBS_NODEFILE is srun hostname | sort -n 

Environmental variables can't be passed in the job script

The job script job.txt including  #SBATCH --output=$HOME/jobtest.out won't work in Slurm. Please use the following instead:

sbatch --output=$HOME/jobtest.out job.txt 

Using mpiexec with Intel MPI

Intel MPI on Pitzer is configured to support PMI and Hydra process managers. It is recommended to use srun as MPI program launcher. If you prefer using mpiexec on Pitzer, you might experience MPI init error or see the message:

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

Please set unset I_MPI_PMI_LIBRARY in job before running MPI programs to resolve the issue.

Using --ntasks-per-node and --mem options together

Right now jobs using --ntasks-per-node and --mem are running into a bug where if --mem divided by MaxMemPerCPU is greater than ntasks-per-node, the job is not seen as schedulable on the partition where the MaxMemPerCPU issue exists.  The MaxMemPerCPU value is set on all partitions to be usable memory divided by cores a given type of node has.  One issue we’ve observed is jobs questing 1 GPU with --ntasks-per-node=4 and --mem=32G are incorrectly only running on the quad GPU nodes even when other GPU nodes are available.

Executables with a certain MPI library using SLURM PMI2 interface


Stopping mpi4py python processes during an interactive job session only from a login node:

$ salloc -t 15:00 --ntasks-per-node=4
salloc: Pending job allocation 20822
salloc: job 20822 queued and waiting for resources
salloc: job 20822 has been allocated resources
salloc: Granted job allocation 20822
salloc: Waiting for resource configuration
salloc: Nodes p0511 are ready for job
# don't login to one of the allocated nodes, stay on the login node
$ module load python/3.7-2019.10
$ source activate testing
(testing) $ srun --quit-on-interrupt python
# enter <ctrl-c>
^Csrun: sending Ctrl-C to job 20822.5
Hello World (from process 0)
process 0 is sleeping...
Hello World (from process 2)
process 2 is sleeping...
Hello World (from process 3)
process 3 is sleeping...
Hello World (from process 1)
process 1 is sleeping...
Traceback (most recent call last):
File "", line 16, in <module>
Traceback (most recent call last):
File "", line 16, in <module>
Traceback (most recent call last):
File "", line 16, in <module>
Traceback (most recent call last):
File "", line 16, in <module>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 20822.5 ON p0511 CANCELLED AT 2020-09-04T10:13:44 ***
# still in the job and able to restart the processes

pbsdcp with Slurm

pbsdcp works correctly with Slurm. But, when you use the wildcard, it should be without quotation marks. In Torque/Moab, you can use it, for example

pbsdcp -g '*' {dest_dir} 

But, with Slurm, it should be without quotation marks:

pbsdcp -g * {dest_dir} 

If you like, you can use sbcast and/or sgather instead of pbsdcp as well.

Signal handling in slurm

The below script needs to use a wait command for the user-defined signal USR1 to be received by the process.

The sleep process is backgrounded using & wait so that the bash shell can receive signals and execute the trap commands instead of ignoring the signals while the sleep process is running.

#SBATCH --job-name=minimal_trap
#SBATCH --time=2:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --output=%x.%A.log
#SBATCH --signal=B:USR1@60

function my_handler() {
  echo "Catching signal"
  touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal

trap my_handler USR1
trap my_handler TERM

sleep 3600 &


'mail' does not work; use 'sendmail'

The 'mail' does not work in a batch job; use 'sendmail' instead as:

sendmail <<EOF
subject: Output path from $SLURM_JOB_ID

Please submit any issue using the webform below:


1 Start 2 Complete

Please report the problem here when you use Slurm

This question is for testing whether you are a human visitor and to prevent automated spam submissions.