This document serves as a knowledge base for properly managing and diagnosing threading issues in user jobs. It focuses on OpenMP, Intel Math Kernel Library (MKL), and common thread-related misuse at OSC.
Understanding Threading with OpenMP and MKL
Intel MKL is widely used in HPC for linear algebra, FFTs, and statistical routines. MKL is multithreaded by default, which can significantly improve performance but only when correctly configured.
Key Environment Variables
Variable |
Applies To |
Description |
---|---|---|
|
All OpenMP programs |
Sets the number of threads for OpenMP. Recognized by all compilers. |
|
Intel MKL libraries |
Sets the number of threads for MKL. Takes precedence over |
Behavior Summary
- MKL subjects to Slurm cgroup limits and defaults to all available cores if neither variable is set.
- If both are set, MKL uses
MKL_NUM_THREADS
for its internal operations, even ifOMP_NUM_THREADS
is higher. - Compiler overrides: Thread count may be overridden by compiler-specific variables (
KMP_NUM_THREADS
, etc.).
Common Thread Misuse Patterns
Users often run a program in parallel using MPI or other approaches without realizing that the program was built with MKL or OpenMP enabled. While they may request sufficient resources for their primary parallelization method, MKL threading can still be automatically activated (as described above), leading to CPU oversubscription and performance degradation.
For example, consider an MPI job that requests 8 CPUs:
#!/bin/bash #SBATCH --ntasks-per-node=8 srun /path/to/mpi/program
Without properly setting OMP_NUM_THREADS
or MKL_NUM_THREADS
, each MPI rank may spawn 8 threads by default. This results in a total of 64 threads (8 threads × 8 ranks), which exceeds the allocated CPU resources. Such oversubscription can severely degrade performance, interfere with other users' jobs on the same node, and in extreme cases, even crash the node.
Best Practice
- Set
MKL_NUM_THREADS=1
unless performance tuning suggests otherwise. - For a hybrid OpenMP + MPI program, use
--cpus-per-task=N
and setOMP_NUM_THREADS=N
accordingly. - Always validate effective thread usage: MPI ranks × threads per rank ≤ allocated CPU cores.
The properly configured job script looks like:
#!/bin/bash #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=8 export MKL_NUM_THREADS=1 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun /path/to/mpi/program