GPU Memory Not Released Causing OOM in Subsequent Jobs
We have noticed that GPU memory is not being properly released in some jobs, causing subsequent jobs on the same nodes to run out of memory (OOM). We are currently working on a resolution.
If you encounter an OOM error in a GPU job, you can use the job-dashboard-link.py script to generate a Grafana dashboard link for your job’s resource usage. For example:
job-dashboard-link.py -M cardinal 2502244
Open the generated URL and navigate to GPU Metrics. Check the GPU Memory Usage panel.