GPU Computing

OSC offers GPU computing on all its systems. While GPUs can provide a significant boost in performance for some applications, the computing model is very different from the CPU. This page will discuss some of the ways you can use GPU computing at OSC.

Accessing GPU Resources

To request nodes with a GPU add the --gpus-per-node=x attribute to the directive in your batch script, for example, on Pitzer:

#SBATCH --gpus-per-node=1

In most cases you'll need to load the cuda module (module load cuda) to make the necessary Nvidia libraries available.

Setting the GPU compute mode (optional)

The GPUs on any cluster can be set to different compute modes as listed here. They can be set by adding the following to the GPU specification when using the srun command. By default it is set to shared.

srun --gpu_cmode=exclusive

srun --gpu_cmode=shared

The compute mode shared is the default on GPU nodes if a compute mode is not specified. With this compute mode, mulitple CUDA processes on the same GPU device are allowed.

Example GPU Jobs

Single-node/Multi-GPU Job Script

#!/bin/bash
#SBATCH --account <Project-ID>
#SBATCH --job-name Pytorch_Example
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --gpus-per-node=4

ml miniconda3/4.10.3-p37 cuda/11.8.0

source activate pytorch

python example.py

Multi-node/Multi-GPU Job Script

#!/bin/bash
#SBATCH --account <Project-ID>
#SBATCH --job-name Pytorch_Example
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --gpus-per-node=4

ml miniconda3/4.10.3-p37 cuda/11.8.0

source activate pytorch

python example.py

If you are using Nsight GPU profiler, you may expereince an error as follows;

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.

This is because GPU monitoring service (DCGM) that we are running on the nodes by default. You can disable it and use Nisght by adding Slurm option --gres=nsight

Running Multiple GPU Tasks in the Same Job

If your job has low GPU utilization, consider running multiple GPU tasks within the same job using the --overlapoption, as demonstrated in the sample script below.

#!/bin/bash 
#SBATCH --job-name=shared-gpu 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=4 
#SBATCH --gpus-per-node=1 
#SBATCH --gpu_cmode=shared 
#SBATCH --time=1:00:00 

# Running 4 tasks on a shared GPU 
srun --overlap --gpus=1 -n 1 ./my-gpu-task1 &
srun --overlap --gpus=1 -n 1 ./my-gpu-task2 &
srun --overlap --gpus=1 -n 1 ./my-gpu-task3 &
srun --overlap --gpus=1 -n 1 ./my-gpu-task4 &
wait

Using GPU-enabled Applications

We have several supported applications that can use GPUs. This includes

Machine learning / Neural networks
- Caffe
- TensorFlow
- Torch
Molecular mechanics / dynamics
- Amber
- Gromacs
- LAMMPS
- NAMD (limited availability)
General mathematics
- MATLAB
Engineering and Quantum Chemistry applications are expected to follow.

Please see the software pages for each application. They have different levels of support for multi-node jobs, cpu/gpu work sharing, and environment set-up.

Libraries with GPU Support

There are a few libraries that provide GPU implementations of commonly used routines. While they mostly hide the details of using a GPU there are still some GPU specifics you'll need to be aware of, e.g. device initialization, threading, and memory allocation. These are available at OSC:

MAGMA

MAGMA is an implementation of BLAS and LAPACK with multi-core (SMP) and GPU support. There are some differences in the API of standard BLAS and LAPACK.

cuBLAS and cuSPARSE

cuBLAS is a highly optimized BLAS from NVIDIA. There are a few versions of this library, from very GPU-specific to nearly transparent. cuSPARSE is a BLAS-like library for sparse matrices.

The MAGMA library is built on cuBLAS.

cuFFT

cuFFT is NVIDIA's Fourier transform library with an API similar to FFTW.

cuDNN

cuDNN is NVIDIA's Deep Neural Network machine learning library. Many ML applications are built on cuDNN.

Direct GPU Programming

GPUs present a different programming model from CPUs so there is a significant time investment in going this route.

OpenACC

OpenACC is a directives-based model similar to OpenMP. Currently this is only supported by the Portland Group C/C++ and Fortran compilers.

OpenCL

OpenCL is a set of libraries and C/C++ compiler extensions supporting GPUs (NVIDIA and AMD) and other hardware accelerators. The CUDA module provides an OpenCL library.

CUDA

CUDA is the standard NVIDIA development environment. In this model explicit GPU code is written in the CUDA C/C++ dialect, compiled with the CUDA compiler NVCC, and linked with a native driver program.

About GPU Hardware

Our GPUs span several generations with different capabilites and ease-of-use. Many of the differences won't be visible when using applications or libraries, but some features and applications may not be supported on the older models.

Pitzer V100

The NVIDIA V100 "Volta" GPU, with a compute capability of 7.0, offers several advanced features, one of which is its Tensor Cores. These Tensor Cores empower the GPU to perform mixed-precision matrix operations, significantly enhancing its efficiency for deep learning workloads and expediting tasks such as AI model training and inference.

The V100 deployed in 2018 comes equipped with 16GB of memory, whereas the V100 deployed in 2020 features 32GB of memory. There are two GPUs per GPU node,

Additionally, there are four large memory nodes equipped with quad NVIDIA Volta V100s with 32GB of GPU memory and NVLink.

Ascend A100

The NVIDIA A100 "Ampere" GPU, with a compute capability of 8.0, empowers advanced deep learning and scientific computing tasks. For instance, it accelerates and enhances the training of deep neural networks, enabling the training of intricate models like GPT-4 in significantly less time when compared to earlier GPU architectures.

On a quad-GPU node, the A100 comes equipped with 80GB of memory. There are 4 GPUs connected via NVLink, offering a total of 320GB of usable GPU memory per node.

On a dual-GPU node, the A100 comes equipped with 40GB of memory per GPU. There are 2 GPUs, providing a total of 80GB of usable GPU memory per node.

Cardinal H100 NVL

The NVIDIA H100 "Hooper" GPU, with a compute capability of 9.0, empowers advanced deep learning and scientific computing tasks. For instance, it accelerates and enhances the training of deep neural networks, enabling the training of intricate models like GPT-4 in significantly less time when compared to earlier GPU architectures.

The H100 comes equipped with 94GB of HBM2e memory. here are 4 GPUs with NVLink, offering 376GB of usable GPU memory per node.

Supercomputer:

Ascend

Cardinal

Pitzer

Search form

GPU Computing

Accessing GPU Resources

Setting the GPU compute mode (optional)

Example GPU Jobs

Single-node/Multi-GPU Job Script

Multi-node/Multi-GPU Job Script

Running Multiple GPU Tasks in the Same Job

Using GPU-enabled Applications

Libraries with GPU Support

MAGMA

cuBLAS and cuSPARSE

cuFFT

cuDNN

Direct GPU Programming

OpenACC

OpenCL

CUDA

About GPU Hardware

Pitzer V100

Ascend A100

Cardinal H100 NVL

Client Resources

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

GPU Computing

Accessing GPU Resources

Setting the GPU compute mode (optional)

Example GPU Jobs

Single-node/Multi-GPU Job Script

Multi-node/Multi-GPU Job Script

Running Multiple GPU Tasks in the Same Job

Using GPU-enabled Applications

Libraries with GPU Support

cuBLAS and cuSPARSE

Direct GPU Programming

About GPU Hardware

Pitzer V100

Ascend A100

Cardinal H100 NVL

Upcoming Events

Recent News

Translate

State Government Links

Education Links