Pitzer Programming Environment

Compilers

C, C++ and Fortran are supported on the Pitzer cluster. Intel, PGI and GNU compiler suites are available. The Intel development tool chain is loaded by default. Compiler commands and recommended options for serial programs are listed in the table below. See also our compilation guide.

The Skylake and Cascade Lake processors that make up Pitzer support the Advanced Vector Extensions (AVX512) instruction set, but you must set the correct compiler flags to take advantage of it. AVX512 has the potential to speed up your code by a factor of 8 or more, depending on the compiler and options you would otherwise use. However, bare in mind that clock speeds decrease as the level of the instruction set increases. So, if your code does not benefit from vectorization it may be beneficial to use a lower instruction set.

In our experience, the Intel compiler usually does the best job of optimizing numerical codes and we recommend that you give it a try if you’ve been using another compiler.

With the Intel compilers, use -xHost and -O2 or higher. With the GNU compilers, use -march=native and -O3. The PGI compilers by default use the highest available instruction set, so no additional flags are necessary.

This advice assumes that you are building and running your code on Owens. The executables will not be portable.  Of course, any highly optimized builds, such as those employing the options above, should be thoroughly validated for correctness.

LANGUAGE INTEL GNU PGI
C icc -O2 -xHost hello.c gcc -O3 -march=native hello.c pgcc -fast hello.c
Fortran 77/90 ifort -O2 -xHost hello.F gfortran -O3 -march=native hello.F pgfortran -fast hello.F
C++ icpc -O2 -xHost hello.cpp g++ -O3 -march=native hello.cpp pgc++ -fast hello.cpp

Parallel Programming

MPI

OSC systems use the MVAPICH2 implementation of the Message Passing Interface (MPI), optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. For more information on building your MPI codes, please visit the MPI Library documentation.

MPI programs are started with the srun command. For example,

#!/bin/bash
#SBATCH --nodes=2

srun [ options ] mpi_prog
Note: the program to be run must either be in your path or have its path specified.

The srun command will normally spawn one MPI process per task requested in a Slurm batch job. Use the -n ntasks and/or --ntasks-per-node=n option to change that behavior. For example,

#!/bin/bash
#SBATCH --nodes=2

# Use the maximum number of CPUs of two nodes
srun ./mpi_prog

# Run 8 processes per node
srun -n 16 --ntasks-per-node=8  ./mpi_prog

The table below shows some commonly used options. Use srun -help for more information.

OPTION COMMENT
-n, --ntasks=ntasks total number of tasks to run
--ntasks-per-node=n number of tasks to invoke on each node
-help Get a list of available options
Note: The information above applies to the MVAPICH2, Intel MPI and OpenMPI installations at OSC. 
Caution: mpiexec or mpiexec.hydra is still supported with Intel MPI and OpenMPI, but it is not fully compatible in our Slurm environment. We recommand using srun in any circumstances.

OpenMP

The Intel, GNU and PGI compilers understand the OpenMP set of directives, which support multithreaded programming. For more information on building OpenMP codes on OSC systems, please visit the OpenMP documentation.

An OpenMP program by default will use a number of threads equal to the number of CPUs requested in a Slurm batch job. To use a different number of threads, set the environment variable OMP_NUM_THREADS. For example,

#!/bin/bash
#SBATCH --ntasks=8

# Run 8 threads
./omp_prog

# Run 4 threads
export OMP_NUM_THREADS=4
./omp_prog

To run a OpenMP job on an exclusive node:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive

export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE
./omp_prog

Interactive job only

Please use -c, --cpus-per-task=X instead of -n, --ntasks=X to request an interactive job. Both result in an interactive job with X CPUs available but only the former option automatically assigns the correct number of threads to the OpenMP program. If  the option --ntasks is used only, the OpenMP program will use one thread or all threads will be bound to one CPU core. 

Hybrid (MPI + OpenMP)

An example of running a job for hybrid code:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --constraint=48core

# Run 4 MPI processes on each node and 12 OpenMP threads spawned from a MPI process
export OMP_NUM_THREADS=12
srun -n 8 -c 12 --ntasks-per-node=4 ./hybrid_prog

To run a job across either 40-core or 48-core nodes exclusively:

#!/bin/bash
#SBATCH --nodes=2

# Run 4 MPI processes on each node and the maximum available OpenMP threads spawned from a MPI process 
export OMP_NUM_THREADS=$(($SLURM_CPUS_ON_NODE/4))
srun -n 8 -c $OMP_NUM_THREADS --ntasks-per-node=4 ./hybrid_prog

Tuning Parallel Program Performance: Process/Thread Placement

To get the maximum performance, it is important to make sure that processes/threads are located as close as possible to their data, and as close as possible to each other if they need to work on the same piece of data, with given the arrangement of node, sockets, and cores, with different access to RAM and caches. 

While cache and memory contention between threads/processes are an issue, it is best to use scatter distribution for code. 

Processes and threads are placed differently depending on the computing resources you requste and the compiler and MPI implementation used to compile your code. For the former, see the above examples to learn how to run a job on exclusive nodes. For the latter, this section summarizes the default behavior and how to modify placement.

OpenMP only

For all three compilers (Intel, GNU, PGI), purely threaded codes do not bind to particular CPU cores by default. In other words, it is possible that multiple threads are bound to the same CPU core

The following table describes how to modify the default placements for pure threaded code:

DISTRIBUTION Compact Scatter/Cyclic
DESCRIPTION Place threads as closely as possible on sockets Distribute threads as evenly as possible across sockets
INTEL KMP_AFFINITY=compact KMP_AFFINITY=scatter
GNU OMP_PLACES=sockets[1] OMP_PROC_BIND=spread/close
PGI[2]

MP_BIND=yes
MP_BLIST="$(seq -s, 0 2 47),$(seq -s, 1 2 47)" 

MP_BIND=yes
  1. Threads in the same socket might be bound to the same CPU core.
  2. PGI LLVM-backend (version 19.1 and later) does not support thread/processors affinity on NUMA architecture. To enable this feature, compile threaded code with --Mnollvm to use proprietary backend.

MPI Only

For MPI-only codes, MVAPICH2 first binds as many processes as possible on one socket, then allocates the remaining processes on the second socket so that consecutive tasks are near each other.  Intel MPI and OpenMPI alternately bind processes on socket 1, socket 2, socket 1, socket 2 etc, as cyclic distribution.

For process distribution across nodes, all MPIs first bind as many processes as possible on one node, then allocates the remaining processes on the second node. 

The following table describe how to modify the default placements on a single node for MPI-only code with the command srun:

DISTRIBUTION
(single node)
Compact Scatter/Cyclic
DESCRIPTION Place processs as closely as possible on sockets Distribute process as evenly as possible across sockets
MVAPICH2[1] Default MV2_CPU_BINDING_POLICY=scatter
INTEL MPI srun --cpu-bind="map_cpu:$(seq -s, 0 2 47),$(seq -s, 1 2 47)" Default
OPENMPI srun --cpu-bind="map_cpu:$(seq -s, 0 2 47),$(seq -s, 1 2 47)" Default
  1. MV2_CPU_BINDING_POLICY will not work if MV2_ENABLE_AFFINITY=0 is set.

To distribute processes evenly across nodes, please set SLURM_DISTRIBUTION=cyclic.

Hybrid (MPI + OpenMP)

For Hybrid codes, each MPI process is allocated  OMP_NUM_THREADS cores and the threads of each process are bound to those cores. All MPI processes (as well as the threads bound to the process) behave as we describe in the previous sections. It means the threads spawned from a MPI process might be bound to the same core. To change the default process/thread placmements, please refer to the tables above. 

Summary

The above tables list the most commonly used settings for process/thread placement. Some compilers and Intel libraries may have additional options for process and thread placement beyond those mentioned on this page. For more information on a specific compiler/library, check the more detailed documentation for that library.

GPU Programming

164 Nvidia V100 GPUs are available on Pitzer.  Please visit our GPU documentation.

Reference

Supercomputer: