This document serves as a knowledge base for properly managing and diagnosing threading issues in user jobs. It focuses on OpenMP, Intel Math Kernel Library (MKL), and common thread-related misuse at OSC.
Understanding Threading with OpenMP and MKL
Intel MKL is widely used in HPC for linear algebra, FFTs, and statistical routines. MKL is multithreaded by default, which can significantly improve performance but only when correctly configured.
Key Environment Variables
Variable |
Applies To |
Description |
---|---|---|
|
All OpenMP programs |
Sets the number of threads for OpenMP. Recognized by all compilers. |
|
Intel MKL libraries |
Sets the number of threads for MKL. Takes precedence over |
Behavior Summary
- MKL subjects to Slurm cgroup limits and defaults to all available cores if neither variable is set.
- If both are set, MKL uses
MKL_NUM_THREADS
for its internal operations, even ifOMP_NUM_THREADS
is higher. - Compiler overrides: Thread count may be overridden by compiler-specific variables (
KMP_NUM_THREADS
, etc.).
Common Thread Misuse Patterns
Users often run programs in parallel using MPI or other approaches without realizing that the program was built with MKL threading or OpenMP enabled. While they may request sufficient resources for their primary parallelization method, MKL threading can still be automatically activated (as described above), leading to CPU oversubscription and performance degradation.
Commonly affected applications at OSC include R, LAMMPS, and GROMACS.
Example: Uncontrolled Threading in an MPI Job
Consider an MPI job that requests 8 CPUs:
#!/bin/bash #SBATCH --ntasks-per-node=8 srun /path/to/mpi/program
Without properly setting OMP_NUM_THREADS
or MKL_NUM_THREADS
, each MPI rank may spawn 8 threads by default. This results in a total of 64 threads (8 threads × 8 ranks), which exceeds the allocated CPU resources. Such oversubscription can severely degrade performance, interfere with other users' jobs on the same node, and in extreme cases, even crash the node.
Best Practice
- Set
MKL_NUM_THREADS=1
unless performance tuning suggests otherwise. - For a hybrid OpenMP + MPI program, use
--cpus-per-task=N
and setOMP_NUM_THREADS=N
accordingly. - If you are unsure whether OpenMP is needed, set
OMP_NUM_THREADS=1
to disable threading safely. - Always validate effective thread usage: MPI ranks × threads per rank ≤ allocated CPU cores.
Example: Properly Configured Job Script (8 OpenMP Threads per MPI Rank)
#!/bin/bash #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=8 export MKL_NUM_THREADS=1 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun /path/to/mpi/program
Example: If OpenMP Threading Is Not Needed
#!/bin/bash #SBATCH --ntasks-per-node=8 export MKL_NUM_THREADS=1 export OMP_NUM_THREADS=1 srun /path/to/mpi/program
Note on Implicit Threading via Libraries
There are several cases where the main program is not explicitly built with MKL threading or OpenMP enabled, but its dependent libraries are. A common example is a Python program that uses NumPy. Certain NumPy operations, such as np.dot
, can leverage MKL or OpenMP internally and spawn multiple threads.
In such cases, if you are unsure whether threading is needed, it is safest to follow the example above and explicitly set:
export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1
This ensures controlled thread usage and prevents unexpected oversubscription.
Uncommon Thread Misuse Cases
Some programs not designed for HPC environments may spawn multiple subprocesses or determine the number of threads by directly reading system information from /proc/cpuinfo, ignoring Slurm-imposed resource limits. In such cases, the standard thread control methods described above may not work, as the internal settings override user-defined environment variables.
Sometimes, these programs offer command-line options or configuration parameters to control threading. Users should consult the program's documentation and explicitly set the number of threads as appropriate for their job's allocated resources.