4:20PM 6/23/2017 Update: All HPC systems are back in production. This outage may cause failures of users' jobs. We'll update the community as more is known.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3:40PM 6/23/2017 Update: All HPC systems are back in production except for scratch service (/fs/scratch
). This outage may cause failures of users' jobs. We'll update the community as more is known.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2:55PM 6/23/2017 Update: All HPC systems are NOT accessible caused by network outage. We'll reboot the network switch to help resolve this issue and update the community as more is known.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4:30PM 6/21/2017 Update: We are experiencing some kind of systemic problem with the HPC systems again including but not limited to:
- /fs/project is not accessible
- Failure of GPU job sumission on both Oakley and Owens
We have paused scheduling on Oakley, Ruby, and Owens now for further investigations.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4:55PM 6/20/2017 Update: All HPC systems are stable operationally at the moment. We have narrowed the issues down to a line card in our virtual chassis fabric, and will schedule a meeting with network vendor to further diagnose the problem.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Original Post:
We are experiencing some kind of systemic problem with the HPC systems. Some login nodes required reboot last night, which is likely related to a larger underlying problem, which we believe may be a networking issue inside the data center. We are actively investigating the issue, and will update the community as more is known.