Owens

Issue with GPFS on Owens since April 14, 2017

3:10PM 4/18/2017 Update: Rolling reboots on Owens have started to address this GPFS issue. 

We have had issues with GPFS mounts on Owens Cluster since Friday afternoon, April 14, 2017. The affected nodes have been marked offline to be restarted or rebooted to fix this issue. Jobs may have been negatively impacted by this issue since April 14. If you experience any 'stale file handle' or file not found errors, please let us know.

Rolling reboot of all clusters, starting from Wednesday morning, April 19, 2017

1:40PM 4/27/2017 Update: Rolling reboots are completed. 

3:10PM 4/18/2017 Update: Rolling reboots on Owens have started to address GPFS errors occured late Friday. 

Rolling reboot of Owens, Oakley, and Ruby clusters is scheduled to start from Wednesday morning, April 19, 2017. Highlights of the rolling reboot activities:

Owens is in Partial Service

3:45PM April 3, 2017 Update: GPU nodes on Owens are available. 

206 Owens nodes are not accessible to users due to GPU testing and a bad Ethernet switch. It is expected that 48 nodes with switch problem will be availabe by Friday, March 31 and the rest for GPU testing will be available on Monday, April 3, 2017. 

We apologize for the inconvenience this may cause you. Please contact oschelp@osc.edu if you have any questions. 

Problems with MVAPICH2

Some MVAPICH2 MPI installations on Oakley, Ruby, and Owens, such as the default module mvapich2/2.2 as well as mvapich2/2.1, appear to have a bug that is triggered by certain programs.  The symptoms are 1) the program hangs or 2) the program fails with an error related to Allreduce or Bcast.

Pages