The Ohio Supercomputer Center (OSC) is experiencing an email delivery problem with several types of messages from MyOSC. 

 OSC is preparing to update Slurm on its production systems to version 23.11.4 on March, 27. 

Systemic Problem on Cluster Computing service

Category: 
Resolution: 
Resolved

4:20PM 6/23/2017 Update: All HPC systems are back in production. This outage may cause failures of users' jobs. We'll update the community as more is known. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

3:40PM 6/23/2017 Update: All HPC systems are back in production except for scratch service (/fs/scratch). This outage may cause failures of users' jobs. We'll update the community as more is known. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2:55PM 6/23/2017 Update: All HPC systems are NOT accessible caused by network outage. We'll reboot the network switch to help resolve this issue and update the community as more is known. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

4:30PM 6/21/2017 Update: We are experiencing some kind of systemic problem with the HPC systems again including but not limited to:  

  • /fs/project is not accessible
  • Failure of GPU job sumission on both Oakley and Owens

We have paused scheduling on Oakley, Ruby, and Owens now for further investigations. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

4:55PM 6/20/2017 Update: All HPC systems are stable operationally at the moment. We have narrowed the issues down to a line card in our virtual chassis fabric, and will schedule a meeting with network vendor to further diagnose the problem. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Original Post:

We are experiencing some kind of systemic problem with the HPC systems. Some login nodes required reboot last night, which is likely related to a larger underlying problem, which we believe may be a networking issue inside the data center. We are actively investigating the issue, and will update the community as more is known.