HOWTO: Estimating and Profiling GPU Memory Usage for Generative AI

Overview

Estimating GPU memory (VRAM) usage for training or running inference with large deep learning models is critical to both 1. requesting the appropriate resources for running your computation and 2. optimizing your job once it is setup.  Out-of-memory (OOM) errors can be avoided by requesting appropriate resources and by better understanding memory usage during the job using memory profiling tools described here. 

 

Estimating GPU Memory Usage for Inference

Estimated GPU VRAM in GB = 2-4x model parameters (in billions)

For example, for GPT2 with 1.5 billion parameters, we estimate 3-6GB are required to run inference.  A model like this should comfortably fit on an A100, V100, or H100 for inference.

This estimate comes partially from this reference.

Estimating GPU Memory Usage for Training

Estimated GPU VRAM in GB = 40x model parameters (in billions)

For example, for LLaMA3 with 7 billion parameters, we estimate minimum 280GB to train it.  This exceeds the VRAM of even a single H100 accelerator, requiring distributed training.  See HOWTO: PyTorch Fully Sharded Data Parallel (FSDP) for more details.

Of note, the training estimate assumes transformer-based architecture with Adam optimizer and is extrapolated from results here: Microsoft Deepspeed.


Profiling GPU Memory Usage During Computation

Once your job is running, a more accurate way of measuring GPU memory is through PyTorch memory snapshotting and profiling, using the NVIDIA nvidia-smi tool, or using the Grafana monitoring tool.  Detailed memory usage can be helpful in debugging and optimizing your application to reduce memory footprint and increase performance.

PyTorch memory snapshotting

  • Start: torch.cuda.memory._record_memory_history(max_entries=100000)
  • Save: torch.cuda.memory._dump_snapshot(file_name)
  • Stop: torch.cuda.memory._record_memory_history(enabled=None)

See documentation here on how to snapshot GPU memory usage while running PyTorch code.

PyTorch Profiler

"PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference. Profiler’s context manager API can be used to better understand what model operators are the most expensive, examine their input shapes and stack traces, study device kernel activity and visualize the execution trace."

See PyTorch Profiler documentation here.

Here is an example walkthrough using both tools.

Nvidia-smi Usage

nvidia-smi is a command-line tool available on all GPU-enabled compute nodes that lists processes and their GPU memory usage. Without any arguments, the output looks like the following:

[username@p0254 ~]$ nvidia-smi
Wed Nov 13 20:58:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           On  |   00000000:3B:00.0 Off |                  Off |
| N/A   27C    P0             37W /  250W |   13830MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     27515      C   .../vllm_env/bin/python                     13818MiB |
+-----------------------------------------------------------------------------------------+

The example output above shows a V100 on a Pitzer compute node running a vllm inference server running a 3 billion parameter model and using about 14GB of GPU memory.

Summary statistics are available at the top, showing the GPUs available and their current and max memory available.  Below, all running processes are shown, with the relevant GPU, PID, Process name, and GPU Memory Usage for that process.

The tool will show multiple GPU devices on the same node if more than one is available, but is limited to one node.

Additional arguments are available, as described in the official documentation.

To run nvidia-smi on the correct node, you will need to ssh to the node where your job is running.  You can find the node hostname using the squeue command:

[username@pitzer-login02 ~]$ squeue -u username
             JOBID PARTITION     NAME     USER   ST       TIME  NODES NODELIST(REASON)
          32521417 gpudebug- interact   username  R       0:38      1 p0254

where "username" is your username.  In the example above, "p0254" is the compute node you need to run the tool on.  The jobid is also useful for other monitoring tools. See HOWTO: Monitoring and Managing Your Job for more details.

Using Grafana Dashboard Metrics

To view job-specific resource usage:

  1. Navigate to the OSC grafana page.
  2. Select Dashboards -> Public -> HPC Job Metrics.
  3. Enter your cluster and jobid (available from squeue - see details above)
  4. Scroll down to "GPU Metrics", and expand to see "GPU Memory Usage" panel.
grafana_dashboard_gpu_memory_usage.png
Grafana HPC Job Metrics Dashboard: GPU Memory Usage Panel

Solving GPU Out-of-Memory Errors

While there is no one-size-fits-all solution to solving OOM errors, here are a few common strategies to try.  If you require assistance, please contact OSC Support.

  • Request more GPU resources
  • Use smaller models – fewer parameters and/or lower precision (quantization)
  • Use flash-attention (not available on V100s)
  • Reduce batch size
  • Use Fully Sharded Data Parallel (FSDP) in PyTorch to distribute large models across multiple GPU devices
  • Use memory profiling tools to identify bottlenecks
  • For inference, ensure gradient computations are disabled
    •  torch.no_grad()
    • model.eval()