Overview
Estimating GPU memory (VRAM) usage for training or running inference with large deep learning models is critical to both 1. requesting the appropriate resources for running your computation and 2. optimizing your job once it is setup. Out-of-memory (OOM) errors can be avoided by requesting appropriate resources and by better understanding memory usage during the job using memory profiling tools described here.
Estimating GPU Memory Usage for Inference
Estimated GPU VRAM in GB = 2-4x model parameters (in billions)
For example, for GPT2 with 1.5 billion parameters, we estimate 3-6GB are required to run inference. A model like this should comfortably fit on an A100, V100, or H100 for inference.
This estimate comes partially from this reference.
Estimating GPU Memory Usage for Training
Estimated GPU VRAM in GB = 40x model parameters (in billions)
For example, for LLaMA3 with 7 billion parameters, we estimate minimum 280GB to train it. This exceeds the VRAM of even a single H100 accelerator, requiring distributed training. See HOWTO: PyTorch Fully Sharded Data Parallel (FSDP) for more details.
Of note, the training estimate assumes transformer-based architecture with Adam optimizer and is extrapolated from results here: Microsoft Deepspeed.
Profiling GPU Memory Usage During Computation
Once your job is running, a more accurate way of measuring GPU memory is through PyTorch memory snapshotting and profiling, using the NVIDIA nvidia-smi
tool, or using the Grafana monitoring tool. Detailed memory usage can be helpful in debugging and optimizing your application to reduce memory footprint and increase performance.
PyTorch memory snapshotting
- Start:
torch.cuda.memory._record_memory_history(max_entries=100000)
- Save:
torch.cuda.memory._dump_snapshot(file_name)
- Stop:
torch.cuda.memory._record_memory_history(enabled=None)
See documentation here on how to snapshot GPU memory usage while running PyTorch code.
PyTorch Profiler
"PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference. Profiler’s context manager API can be used to better understand what model operators are the most expensive, examine their input shapes and stack traces, study device kernel activity and visualize the execution trace."
See PyTorch Profiler documentation here.
Here is an example walkthrough using both tools.
Nvidia-smi Usage
nvidia-smi
is a command-line tool available on all GPU-enabled compute nodes that lists processes and their GPU memory usage. Without any arguments, the output looks like the following:
[username@p0254 ~]$ nvidia-smi Wed Nov 13 20:58:25 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla V100-PCIE-16GB On | 00000000:3B:00.0 Off | Off | | N/A 27C P0 37W / 250W | 13830MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 27515 C .../vllm_env/bin/python 13818MiB | +-----------------------------------------------------------------------------------------+
The example output above shows a V100 on a Pitzer compute node running a vllm inference server running a 3 billion parameter model and using about 14GB of GPU memory.
Summary statistics are available at the top, showing the GPUs available and their current and max memory available. Below, all running processes are shown, with the relevant GPU, PID, Process name, and GPU Memory Usage for that process.
The tool will show multiple GPU devices on the same node if more than one is available, but is limited to one node.
Additional arguments are available, as described in the official documentation.
To run nvidia-smi
on the correct node, you will need to ssh
to the node where your job is running. You can find the node hostname using the squeue
command:
[username@pitzer-login02 ~]$ squeue -u username JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 32521417 gpudebug- interact username R 0:38 1 p0254
where "username" is your username. In the example above, "p0254" is the compute node you need to run the tool on. The jobid is also useful for other monitoring tools. See HOWTO: Monitoring and Managing Your Job for more details.
Using Grafana Dashboard Metrics
To view job-specific resource usage:
- Navigate to the OSC grafana page.
- Select Dashboards -> Public -> HPC Job Metrics.
- Enter your cluster and jobid (available from
squeue
- see details above) - Scroll down to "GPU Metrics", and expand to see "GPU Memory Usage" panel.
Solving GPU Out-of-Memory Errors
While there is no one-size-fits-all solution to solving OOM errors, here are a few common strategies to try. If you require assistance, please contact OSC Support.
- Request more GPU resources
- Use smaller models – fewer parameters and/or lower precision (quantization)
- Use flash-attention (not available on V100s)
- Reduce batch size
- Use Fully Sharded Data Parallel (FSDP) in PyTorch to distribute large models across multiple GPU devices
- Use memory profiling tools to identify bottlenecks
- For inference, ensure gradient computations are disabled
-
torch.no_grad()
model.eval()
-