Thread Usage Best Practices

This document serves as a knowledge base for properly managing and diagnosing threading issues in user jobs. It focuses on OpenMP, Intel Math Kernel Library (MKL), and common thread-related misuse at OSC.

Understanding Threading with OpenMP and MKL

Intel MKL is widely used in HPC for linear algebra, FFTs, and statistical routines. MKL is multithreaded by default, which can significantly improve performance but only when correctly configured

Key Environment Variables

Variable

Applies To

Description

OMP_NUM_THREADS

All OpenMP programs

Sets the number of threads for OpenMP. Recognized by all compilers.

MKL_NUM_THREADS

Intel MKL libraries

Sets the number of threads for MKL. Takes precedence over OMP_NUM_THREADS for MKL routines.

Behavior Summary

  • MKL subjects to Slurm cgroup limits and defaults to all available cores if neither variable is set.
  • If both are set, MKL uses MKL_NUM_THREADS for its internal operations, even if OMP_NUM_THREADS is higher.
  • Compiler overrides: Thread count may be overridden by compiler-specific variables (KMP_NUM_THREADS, etc.).

Common Thread Misuse Patterns

Users often run a program in parallel using MPI or other approaches without realizing that the program was built with MKL or OpenMP enabled.  While they may request sufficient resources for their primary parallelization method, MKL threading can still be automatically activated (as described above), leading to CPU oversubscription and performance degradation.

For example, consider an MPI job that requests 8 CPUs:

#!/bin/bash
#SBATCH --ntasks-per-node=8

srun /path/to/mpi/program

Without properly setting OMP_NUM_THREADS or MKL_NUM_THREADS, each MPI rank may spawn 8 threads by default. This results in a total of 64 threads (8 threads × 8 ranks), which exceeds the allocated CPU resources. Such oversubscription can severely degrade performance, interfere with other users' jobs on the same node, and in extreme cases, even crash the node.

Best Practice

  • Set MKL_NUM_THREADS=1 unless performance tuning suggests otherwise.
  • For a hybrid OpenMP + MPI program, use --cpus-per-task=N and set OMP_NUM_THREADS=N accordingly.
  • Always validate effective thread usage: MPI ranks × threads per rank ≤ allocated CPU cores.

The properly configured job script looks like:

#!/bin/bash
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8

export MKL_NUM_THREADS=1 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun /path/to/mpi/program

 

Service: