HOWTO: Estimating and Profiling GPU Memory Usage for Generative AI

Overview

Estimating GPU memory (VRAM) usage for training or running inference with large deep learning models is critical to both 1. requesting the appropriate resources for running your computation and 2. optimizing your job once it is setup.  Out-of-memory (OOM) errors can be avoided by requesting appropriate resources and by better understanding memory usage during the job using memory profiling tools described here. 

 

Estimating GPU Memory Usage for Inference

Estimated GPU VRAM in GB = 4x model parameters (in billions)

For example, for StableCode with 3 billion parameters, we estimate 12 GB are required to run inference.  A model like this should fit on an A100, V100, or H100 for inference.

This estimate comes partially from this reference.

Estimating GPU Memory Usage for Training

Estimated GPU VRAM in GB = 40x model parameters (in billions)

For example, for LLaMA-3 with 7 billion parameters, we estimate minimum 280GB to train it.  This exceeds the VRAM of even a single H100 accelerator, requiring distributed training.  See HOWTO: PyTorch Fully Sharded Data Parallel (FSDP) for more details.

Of note, the training estimate assumes transformer-based architecture with Adam optimizer and is extrapolated from results here: Microsoft Deepspeed.


Example GPU Memory Usage for Selected Models

GPU memory usage for selected models
Model Name Parameter count (billions) Training / Inference Batch Size min GPUs required GPU Memory Usage (GB)
minGPT (GPT-2) 0.12 training 216 1 V100 (16GB) 9
T5 (small) 3 training 4 1 H100 (94GB) 81
T5 (medium) 11

training

4 8 H100s (94GB) 760
Stable-Code-3b 3 inference 256 1 V100 (16GB)

14

Falcon-7b-Instruct 7 inference 256 1 V100 (32GB) 29
CodeLlama-13b-Instruct-hf 13 inference 256 1 H100 (94GB) 85
Starcoder 15 inference 256 1 H100 (94GB) 85

Training memory usage was obtained from Prometheus data.  Inference usage was measured with nvidia-smi and vllm. While there is some variation, observed usage generally follows the estimates above.


Profiling GPU Memory Usage During Computation

There are a number of tools that can be used to gather more information about your job's GPU memory usage. Detailed memory usage can be helpful in debugging and optimizing your application to reduce memory footprint and increase performance.

GPU Usage script

The get_gpu_usage script is available on all OSC clusters. Start with this usage script to determine the maximum memory requirements of your job. Once your job has completed, provide the SLURM job ID (and optionally cluster name) to get the maximum memory usage on each GPU used on your job. For example, 

$ get_gpu_usage -M cardinal 477503
Host c0813 GPU #0: 19834 MB
Host c0813 GPU #1: 33392 MB
Host c0813 GPU #2: 28260 MB
Host c0813 GPU #3: 28244 MB
Host c0823 GPU #0: 19808 MB
Host c0823 GPU #1: 33340 MB
Host c0823 GPU #2: 28260 MB
Host c0823 GPU #3: 28244 MB

Nvidia-smi Usage

nvidia-smi is a command-line tool available on all GPU-enabled compute nodes that lists processes and their GPU memory usage. Without any arguments, the output looks like the following:

[username@p0254 ~]$ nvidia-smi
Wed Nov 13 20:58:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           On  |   00000000:3B:00.0 Off |                  Off |
| N/A   27C    P0             37W /  250W |   13830MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     27515      C   .../vllm_env/bin/python                     13818MiB |
+-----------------------------------------------------------------------------------------+

The example output above shows a V100 on a Pitzer compute node running a vllm inference server running a 3 billion parameter model and using about 14GB of GPU memory.

Summary statistics are available at the top, showing the GPUs available and their current and max memory available.  Below, all running processes are shown, with the relevant GPU, PID, Process name, and GPU Memory Usage for that process.

The tool will show multiple GPU devices on the same node if more than one is available, but is limited to one node.

Additional arguments are available, as described in the official documentation.

To run nvidia-smi on the correct node, you will need to ssh to the node where your job is running.  You can find the node hostname using the squeue command:

[username@pitzer-login02 ~]$ squeue -u username
             JOBID PARTITION     NAME     USER   ST       TIME  NODES NODELIST(REASON)
          32521417 gpudebug- interact   username  R       0:38      1 p0254

where "username" is your username.  In the example above, "p0254" is the compute node you need to run the tool on.  The jobid is also useful for other monitoring tools. See HOWTO: Monitoring and Managing Your Job for more details.

Grafana Dashboard Metrics

Grafana provides a dashboard that shows a timeline of GPU memory and usage over time during your job. The script job-dashboard-link.py, available on all OSC clusters, generates a link that can be used to view the dashboard for your job. Provide the SLURM job ID to the script. Copy it to your browser and scroll down to "GPU Metrics", then expand to see "GPU Memory Usage" panel.

grafana_dashboard_gpu_memory_usage.png
Grafana HPC Job Metrics Dashboard: GPU Memory Usage Panel

This can give you an idea of when in your job the memory usage peaked and how long it stayed there.

PyTorch memory snapshotting

This tool requires the following minor modifications to your code

  • Start: torch.cuda.memory._record_memory_history(max_entries=100000)
  • Save: torch.cuda.memory._dump_snapshot(file_name)
  • Stop: torch.cuda.memory._record_memory_history(enabled=None)

This creates a trace file that can viewed by using the javascript code available here. This trace contains information about which called memory allocations and deallocations. This information is sufficient in most cases to understand the memory behavior of your applications. The following two tools can be used to provide additional information, but are only recommended for advanced users.

See documentation here for more information on how to snapshot GPU memory usage while running PyTorch code. 

PyTorch Profiler

"PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference. Profiler’s context manager API can be used to better understand what model operators are the most expensive, examine their input shapes and stack traces, study device kernel activity and visualize the execution trace."

The PyTorch profiler also requires code modifications. It provides a suite of configuration options for what information to track and how to export it. The overhead (both in terms of slowing down your job and the size of the profile files) can get very large. There are multiple ways to view the profile data (tensorboard, HTA, chrome browser, etc). At time of writing (2/18/25), tensorboard support has been officially deprecated while HTA is still experimental.

See PyTorch Profiler documentation here.

Here is an example walkthrough using both tools.

NVIDIA Nsight Systems

This profiler provides detailed hardware-level information about what the GPU did during your job. It can be challenging to map hardware events to user-level functions when using Nsight, particularly for Python-based codes. This is only recommended for advanced users. Documention from NVIDIA on how to use Nsight Systems is available here.

Solving GPU Out-of-Memory Errors

While there is no one-size-fits-all solution to solving OOM errors, here are a few common strategies to try.  If you require assistance, please contact OSC Support.

  • Request more GPU resources
  • Use smaller models – fewer parameters and/or lower precision (quantization)
  • Use flash-attention (not available on V100s)
  • Reduce batch size
  • Use Fully Sharded Data Parallel (FSDP) in PyTorch to distribute large models across multiple GPU devices
  • Use memory profiling tools to identify bottlenecks
  • For inference, ensure gradient computations are disabled
    •  torch.no_grad()
    • model.eval()

Estimate Your GPU Memory Usage for Training