Thread Usage Best Practices

This document serves as a knowledge base for properly managing and diagnosing threading issues in user jobs. It focuses on OpenMP, Intel Math Kernel Library (MKL), and common thread-related misuse at OSC.

Understanding Threading with OpenMP and MKL

Intel MKL is widely used in HPC for linear algebra, FFTs, and statistical routines. MKL is multithreaded by default, which can significantly improve performance but only when correctly configured

Key Environment Variables

Variable

Applies To

Description

OMP_NUM_THREADS

All OpenMP programs

Sets the number of threads for OpenMP. Recognized by all compilers.

MKL_NUM_THREADS

Intel MKL libraries

Sets the number of threads for MKL. Takes precedence over OMP_NUM_THREADS for MKL routines.

Behavior Summary

  • MKL subjects to Slurm cgroup limits and defaults to all available cores if neither variable is set.
  • If both are set, MKL uses MKL_NUM_THREADS for its internal operations, even if OMP_NUM_THREADS is higher.
  • Compiler overrides: Thread count may be overridden by compiler-specific variables (KMP_NUM_THREADS, etc.).

Common Thread Misuse Patterns

Users often run programs in parallel using MPI or other approaches without realizing that the program was built with MKL threading or OpenMP enabled. While they may request sufficient resources for their primary parallelization method, MKL threading can still be automatically activated (as described above), leading to CPU oversubscription and performance degradation.

Commonly affected applications at OSC include R, LAMMPS, and GROMACS.

Example: Uncontrolled Threading in an MPI Job

Consider an MPI job that requests 8 CPUs:

#!/bin/bash
#SBATCH --ntasks-per-node=8

srun /path/to/mpi/program

Without properly setting OMP_NUM_THREADS or MKL_NUM_THREADS, each MPI rank may spawn 8 threads by default. This results in a total of 64 threads (8 threads × 8 ranks), which exceeds the allocated CPU resources. Such oversubscription can severely degrade performance, interfere with other users' jobs on the same node, and in extreme cases, even crash the node.

Best Practice

  • Set MKL_NUM_THREADS=1 unless performance tuning suggests otherwise.
  • For a hybrid OpenMP + MPI program, use --cpus-per-task=N and set OMP_NUM_THREADS=N accordingly.
  • If you are unsure whether OpenMP is needed, set OMP_NUM_THREADS=1 to disable threading safely.
  • Always validate effective thread usage: MPI ranks × threads per rank ≤ allocated CPU cores.

Example: Properly Configured Job Script (8 OpenMP Threads per MPI Rank)

#!/bin/bash
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8

export MKL_NUM_THREADS=1 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun /path/to/mpi/program

Example: If OpenMP Threading Is Not Needed

#!/bin/bash
#SBATCH --ntasks-per-node=8

export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1
srun /path/to/mpi/program

Note on Implicit Threading via Libraries

There are several cases where the main program is not explicitly built with MKL threading or OpenMP enabled, but its dependent libraries are. A common example is a Python program that uses NumPy. Certain NumPy operations, such as np.dot, can leverage MKL or OpenMP internally and spawn multiple threads.

In such cases, if you are unsure whether threading is needed, it is safest to follow the example above and explicitly set:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

This ensures controlled thread usage and prevents unexpected oversubscription.

Uncommon Thread Misuse Cases

Some programs not designed for HPC environments may spawn multiple subprocesses or determine the number of threads by directly reading system information from /proc/cpuinfo, ignoring Slurm-imposed resource limits. In such cases, the standard thread control methods described above may not work, as the internal settings override user-defined environment variables.

Sometimes, these programs offer command-line options or configuration parameters to control threading. Users should consult the program's documentation and explicitly set the number of threads as appropriate for their job's allocated resources.

Service: