This page documents the known issues for migrating jobs from Torque to Slurm.
$PBS_NODEFILE and $SLURM_JOB_NODELIST
Please be aware that $PBS_NODEFILE
is a file while $SLURM_JOB_NODELIST
is a string variable.
The analog on Slurm to cat $PBS_NODEFILE
is srun hostname | sort -n
Environmental variables can't be passed in the job script
The job script job.txt
including #SBATCH --output=$HOME/jobtest.out
won't work in Slurm. Please use the following instead:
sbatch --output=$HOME/jobtest.out job.txt
Using mpiexec with Intel MPI
Intel MPI on Pitzer is configured to support PMI and Hydra process managers. It is recommended to use srun
as MPI program launcher. If you prefer using mpiexec on Pitzer, you might experience MPI init error or see the message:
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
Please set unset I_MPI_PMI_LIBRARY
in job before running MPI programs to resolve the issue.
Using --ntasks-per-node and --mem options together
Right now jobs using --ntasks-per-node
and --mem
are running into a bug where if --mem
divided by MaxMemPerCPU is greater than ntasks-per-node, the job is not seen as schedulable on the partition where the MaxMemPerCPU issue exists. The MaxMemPerCPU value is set on all partitions to be usable memory divided by cores a given type of node has. One issue we’ve observed is jobs questing 1 GPU with --ntasks-per-node=4
and --mem=32G
are incorrectly only running on the quad GPU nodes even when other GPU nodes are available.
Executables with a certain MPI library using SLURM PMI2 interface
e.g.
Stopping mpi4py python processes during an interactive job session only from a login node:
pbsdcp with Slurm
pbsdcp
works correctly with Slurm. But, when you use the wildcard, it should be without quotation marks. In Torque/Moab, you can use it, for example
pbsdcp -g '*' {dest_dir}
But, with Slurm, it should be without quotation marks:
pbsdcp -g * {dest_dir}
If you like, you can use sbcast
and/or sgather
instead of pbsdcp
as well.
Signal handling in slurm
The below script needs to use a wait command for the user-defined signal USR1 to be received by the process.
The sleep process is backgrounded using & wait
so that the bash shell can receive signals and execute the trap commands instead of ignoring the signals while the sleep process is running.
----- #!/bin/bash #SBATCH --job-name=minimal_trap #SBATCH --time=2:00 #SBATCH --nodes=1 --ntasks-per-node=1 #SBATCH --output=%x.%A.log #SBATCH --signal=B:USR1@60 function my_handler() { echo "Catching signal" touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal exit } trap my_handler USR1 trap my_handler TERM sleep 3600 & wait -----
reference: https://bugs.schedmd.com/show_bug.cgi?id=9715
'mail' does not work; use 'sendmail'
The 'mail' does not work in a batch job; use 'sendmail' instead as:
sendmail user@example.com <<EOF subject: Output path from $SLURM_JOB_ID from: user@example.com ... EOF
Please submit any issue using the webform below: