Oakley

Rolling reboot of all clusters, starting from Wednesday morning, April 19, 2017

1:40PM 4/27/2017 Update: Rolling reboots are completed. 

3:10PM 4/18/2017 Update: Rolling reboots on Owens have started to address GPFS errors occured late Friday. 

Rolling reboot of Owens, Oakley, and Ruby clusters is scheduled to start from Wednesday morning, April 19, 2017. Highlights of the rolling reboot activities:

Problems with MVAPICH2

Some MVAPICH2 MPI installations on Oakley, Ruby, and Owens, such as the default module mvapich2/2.2 as well as mvapich2/2.1, appear to have a bug that is triggered by certain programs.  The symptoms are 1) the program hangs or 2) the program fails with an error related to Allreduce or Bcast.

Lustre bug causing Oakley login node crashes

Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug.  The bug (or issue otherwise) seems to be activated when a user does operations on a lustre directory that contains an excessive number of files (10000+ files).  

Our support contacts have been contacted and we are working with them to resolve this issue.  Updates will be posted both here.

Pages

Subscribe to Oakley