Performance Regression of GPU Nodes on Ruby

We currently have performance regression of Ruby's GPU nodes. Some of the GPU nodes on Ruby will remain in a power-saving state even after an application starts using them, resulting in performance reduction in some cases. We currently have a reservation on the GPU nodes so that we can do a rolling reboot on them to get them back into a known-good state.

We have opened a bug report with the vendor about this performance regression and how to monitor for it.


Nvidia drivers on Oakley

We upgraded the drivers for the Nvidia GPUs on all of our clusters during the downtime this week. Unfortunately, we are noticing some subtle problems with the GPUs on Oakley. We will be rolling back to an older driver on that cluster; the GPUs will be unavailable until that work is completed, potentially all weekend.

Can not change GPU compute mode on Oakley

Update: The driver version has been updated and the issue has been fixed.


In updating the driver version for Oakley's NVIDIA GPUs the NVML libraries that are used in conjunction with torque, our resource manager, were also updated.  Unfortunatelly this update has broken the ability to change the GPUs compute mode from its default exclusive thread mode.  Requests to change the GPU mode both through a PBS directive and the nvidia-smi command will both fail.

Subscribe to GPU