4:30PM 6/21/2017 Update: We are experiencing some kind of systemic problem with the HPC systems again including but not limited to:
- /fs/project is not accessible
- Failure of GPU job sumission on both Oakley and Owens
We have paused scheduling on Oakley, Ruby, and Owens now for further investigations.
4:55PM 6/20/2017 Update: All HPC systems are stable operationally at the moment. We have narrowed the issues down to a line card in our virtual chassis fabric, and will schedule a meeting with network vendor to further diagnose the problem.
We are experiencing some kind of systemic problem with the HPC systems. Some login nodes required reboot last night, which is likely related to a larger underlying problem, which we believe may be a networking issue inside the data center. We are actively investigating the issue, and will update the community as more is known.