Users may have been experiencing job failures on Owens cluster since April 16, 2018
We will have a rolling reboot of login and compute nodes of Owens cluster starting from Monday, April 16, 2018.
We will have rolling reboots of Oakley, Ruby and Owens clusters starting from Monday Feb 5, 2018.
Updated at 9:07PM on Dec 20, 2017 :
Owens batch was restored by updating Torque resource manager at 6:37pm Dec 19, 2017.
Original Post at 4:45PM on Dec 19, 2017:
Owens batch has been down since approximately 4pm Dec 19, 2017 with returning the following message:
Rolling reboot of Owens cluster, starting from 8:30AM Oct 30, 2017
We will have a rolling reboot of Owens starting from 9AM on Monday, September 11 2017.
All PBS commands on Owens are working now
Rolling reboot of login and compute nodes of Owens cluster is completed.
There is a bug with VASP 5.4.1 built with mvapich2/2.2 on Owens such that the VASP job with out-of-memory issue crashes the Owens compute node(s). We will investigate monitoring for this type of jobs so that we can cleanup after the job more efficiently, and notify the user of their problem more quickly.
3:10PM 4/18/2017 Update: Rolling reboots on Owens have started to address this GPFS issue.
We have had issues with GPFS mounts on Owens Cluster since Friday afternoon, April 14, 2017. The affected nodes have been marked offline to be restarted or rebooted to fix this issue. Jobs may have been negatively impacted by this issue since April 14. If you experience any 'stale file handle' or file not found errors, please let us know.