Ruby

Rolling reboot of all clusters, starting from Wednesday morning, April 19, 2017

1:40PM 4/27/2017 Update: Rolling reboots are completed. 

3:10PM 4/18/2017 Update: Rolling reboots on Owens have started to address GPFS errors occured late Friday. 

Rolling reboot of Owens, Oakley, and Ruby clusters is scheduled to start from Wednesday morning, April 19, 2017. Highlights of the rolling reboot activities:

Performance Regression of GPU Nodes on Ruby

We currently have performance regression of Ruby's GPU nodes. Some of the GPU nodes on Ruby will remain in a power-saving state even after an application starts using them, resulting in performance reduction in some cases. We currently have a reservation on the GPU nodes so that we can do a rolling reboot on them to get them back into a known-good state.

We have opened a bug report with the vendor about this performance regression and how to monitor for it.

 

Pages