Systemic Problem on Cluster Computing service

Category: 
Resolution: 
Unresolved

4:30PM 6/21/2017 Update: We are experiencing some kind of systemic problem with the HPC systems again including but not limited to:  

  • /fs/project is not accessible
  • Failure of GPU job sumission on both Oakley and Owens

We have paused scheduling on Oakley, Ruby, and Owens now for further investigations. 

+++++++++++++++++++

4:55PM 6/20/2017 Update: All HPC systems are stable operationally at the moment. We have narrowed the issues down to a line card in our virtual chassis fabric, and will schedule a meeting with network vendor to further diagnose the problem. 

+++++++++++++++++++

Original Post:

We are experiencing some kind of systemic problem with the HPC systems. Some login nodes required reboot last night, which is likely related to a larger underlying problem, which we believe may be a networking issue inside the data center. We are actively investigating the issue, and will update the community as more is known.