Overview
Estimating GPU memory (VRAM) usage for training or running inference with large deep learning models is critical to both 1. requesting the appropriate resources for running your computation and 2. optimizing your job once it is setup. Out-of-memory (OOM) errors can be avoided by requesting appropriate resources and by better understanding memory usage during the job using memory profiling tools described here.
Estimating GPU Memory Usage for Inference
Estimated GPU VRAM in GB = 4x model parameters (in billions)
For example, for StableCode with 3 billion parameters, we estimate 12 GB are required to run inference. A model like this should fit on an A100, V100, or H100 for inference.
This estimate comes partially from this reference.
Estimating GPU Memory Usage for Training
Estimated GPU VRAM in GB = 40x model parameters (in billions)
For example, for LLaMA-3 with 7 billion parameters, we estimate minimum 280GB to train it. This exceeds the VRAM of even a single H100 accelerator, requiring distributed training. See HOWTO: PyTorch Fully Sharded Data Parallel (FSDP) for more details.
Of note, the training estimate assumes transformer-based architecture with Adam optimizer and is extrapolated from results here: Microsoft Deepspeed.
Example GPU Memory Usage for Selected Models
Model Name | Parameter count (billions) | Training / Inference | Batch Size | min GPUs required | GPU Memory Usage (GB) |
---|---|---|---|---|---|
minGPT (GPT-2) | 0.12 | training | 216 | 1 V100 (16GB) | 9 |
T5 (small) | 3 | training | 4 | 1 H100 (94GB) | 81 |
T5 (medium) | 11 |
training |
4 | 8 H100s (94GB) | 760 |
Stable-Code-3b | 3 | inference | 256 | 1 V100 (16GB) |
14 |
Falcon-7b-Instruct | 7 | inference | 256 | 1 V100 (32GB) | 29 |
CodeLlama-13b-Instruct-hf | 13 | inference | 256 | 1 H100 (94GB) | 85 |
Starcoder | 15 | inference | 256 | 1 H100 (94GB) | 85 |
Training memory usage was obtained from Prometheus data. Inference usage was measured with nvidia-smi and vllm. While there is some variation, observed usage generally follows the estimates above.
Profiling GPU Memory Usage During Computation
There are a number of tools that can be used to gather more information about your job's GPU memory usage. Detailed memory usage can be helpful in debugging and optimizing your application to reduce memory footprint and increase performance.
GPU Usage script
The get_gpu_usage
script is available on all OSC clusters. Start with this usage script to determine the maximum memory requirements of your job. Once your job has completed, provide the SLURM job ID (and optionally cluster name) to get the maximum memory usage on each GPU used on your job. For example,
$ get_gpu_usage -M cardinal 477503 Host c0813 GPU #0: 19834 MB Host c0813 GPU #1: 33392 MB Host c0813 GPU #2: 28260 MB Host c0813 GPU #3: 28244 MB Host c0823 GPU #0: 19808 MB Host c0823 GPU #1: 33340 MB Host c0823 GPU #2: 28260 MB Host c0823 GPU #3: 28244 MB
Nvidia-smi Usage
nvidia-smi
is a command-line tool available on all GPU-enabled compute nodes that lists processes and their GPU memory usage. Without any arguments, the output looks like the following:
[username@p0254 ~]$ nvidia-smi Wed Nov 13 20:58:25 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla V100-PCIE-16GB On | 00000000:3B:00.0 Off | Off | | N/A 27C P0 37W / 250W | 13830MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 27515 C .../vllm_env/bin/python 13818MiB | +-----------------------------------------------------------------------------------------+
The example output above shows a V100 on a Pitzer compute node running a vllm inference server running a 3 billion parameter model and using about 14GB of GPU memory.
Summary statistics are available at the top, showing the GPUs available and their current and max memory available. Below, all running processes are shown, with the relevant GPU, PID, Process name, and GPU Memory Usage for that process.
The tool will show multiple GPU devices on the same node if more than one is available, but is limited to one node.
Additional arguments are available, as described in the official documentation.
To run nvidia-smi
on the correct node, you will need to ssh
to the node where your job is running. You can find the node hostname using the squeue
command:
[username@pitzer-login02 ~]$ squeue -u username JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 32521417 gpudebug- interact username R 0:38 1 p0254
where "username" is your username. In the example above, "p0254" is the compute node you need to run the tool on. The jobid is also useful for other monitoring tools. See HOWTO: Monitoring and Managing Your Job for more details.
Grafana Dashboard Metrics
Grafana provides a dashboard that shows a timeline of GPU memory and usage over time during your job. The script
, available on all OSC clusters, generates a link that can be used to view the dashboard for your job. Provide the SLURM job ID to the script. Copy it to your browser and scroll down to "GPU Metrics", then expand to see "GPU Memory Usage" panel.job-dashboard-link.py

This can give you an idea of when in your job the memory usage peaked and how long it stayed there.
PyTorch memory snapshotting
This tool requires the following minor modifications to your code
- Start:
torch.cuda.memory._record_memory_history(max_entries=100000)
- Save:
torch.cuda.memory._dump_snapshot(file_name)
- Stop:
torch.cuda.memory._record_memory_history(enabled=None)
This creates a trace file that can viewed by using the javascript code available here. This trace contains information about which called memory allocations and deallocations. This information is sufficient in most cases to understand the memory behavior of your applications. The following two tools can be used to provide additional information, but are only recommended for advanced users.
See documentation here for more information on how to snapshot GPU memory usage while running PyTorch code.
PyTorch Profiler
"PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference. Profiler’s context manager API can be used to better understand what model operators are the most expensive, examine their input shapes and stack traces, study device kernel activity and visualize the execution trace."
The PyTorch profiler also requires code modifications. It provides a suite of configuration options for what information to track and how to export it. The overhead (both in terms of slowing down your job and the size of the profile files) can get very large. There are multiple ways to view the profile data (tensorboard, HTA, chrome browser, etc). At time of writing (2/18/25), tensorboard support has been officially deprecated while HTA is still experimental.
See PyTorch Profiler documentation here.
Here is an example walkthrough using both tools.
NVIDIA Nsight Systems
This profiler provides detailed hardware-level information about what the GPU did during your job. It can be challenging to map hardware events to user-level functions when using Nsight, particularly for Python-based codes. This is only recommended for advanced users. Documention from NVIDIA on how to use Nsight Systems is available here.
Solving GPU Out-of-Memory Errors
While there is no one-size-fits-all solution to solving OOM errors, here are a few common strategies to try. If you require assistance, please contact OSC Support.
- Request more GPU resources
- Use smaller models – fewer parameters and/or lower precision (quantization)
- Use flash-attention (not available on V100s)
- Reduce batch size
- Use Fully Sharded Data Parallel (FSDP) in PyTorch to distribute large models across multiple GPU devices
- Use memory profiling tools to identify bottlenecks
- For inference, ensure gradient computations are disabled
-
torch.no_grad()
model.eval()
-
Estimate Your GPU Memory Usage for Training