Nsight GPU profiler not working due to DCGM conflict

UPDATE (Mar 15, 2023)

After the downtime on Mar. 14, 2023, OSC enabled a new Slurm option --gres=nsight. DCGM will be disabled on the nodes for the job with the Slurm option, and Nsight will function normally.


We are experiencing an issue with Nsight GPU profiler, which is affected by the GPU monitoring service (DCGM) that we are running.

This causes Nsight to malfunction, and produce error messages:

Performance Regression of GPU Nodes on Ruby

We currently have performance regression of Ruby's GPU nodes. Some of the GPU nodes on Ruby will remain in a power-saving state even after an application starts using them, resulting in performance reduction in some cases. We currently have a reservation on the GPU nodes so that we can do a rolling reboot on them to get them back into a known-good state.

We have opened a bug report with the vendor about this performance regression and how to monitor for it.


Nvidia drivers on Oakley

We upgraded the drivers for the Nvidia GPUs on all of our clusters during the downtime this week. Unfortunately, we are noticing some subtle problems with the GPUs on Oakley. We will be rolling back to an older driver on that cluster; the GPUs will be unavailable until that work is completed, potentially all weekend.

Can not change GPU compute mode on Oakley

Update: The driver version has been updated and the issue has been fixed.


In updating the driver version for Oakley's NVIDIA GPUs the NVML libraries that are used in conjunction with torque, our resource manager, were also updated.  Unfortunatelly this update has broken the ability to change the GPUs compute mode from its default exclusive thread mode.  Requests to change the GPU mode both through a PBS directive and the nvidia-smi command will both fail.