Category:
Resolution:
Unresolved
We have noticed that GPU memory is not being properly released in some jobs, causing subsequent jobs on the same nodes to run out of memory (OOM). We are currently working on a resolution.
If you encounter an OOM error in a GPU job, you can use the job-dashboard-link.py script to generate a Grafana dashboard link for your job’s resource usage. For example:
job-dashboard-link.py -M cardinal 2502244
Open the generated URL and navigate to GPU Metrics. Check the GPU Memory Usage panel.
If the usage starts at a high value, your job may be affected by this GPU memory release issue.
Please report the problem to us and include the node hostname.