Systemic Problem on Cluster Computing service

Category:

Resolution:

Resolved

4:20PM 6/23/2017 Update: All HPC systems are back in production. This outage may cause failures of users' jobs. We'll update the community as more is known.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

3:40PM 6/23/2017 Update: All HPC systems are back in production except for scratch service (/fs/scratch). This outage may cause failures of users' jobs. We'll update the community as more is known.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2:55PM 6/23/2017 Update: All HPC systems are NOT accessible caused by network outage. We'll reboot the network switch to help resolve this issue and update the community as more is known.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

4:30PM 6/21/2017 Update: We are experiencing some kind of systemic problem with the HPC systems again including but not limited to:

/fs/project is not accessible
Failure of GPU job sumission on both Oakley and Owens

We have paused scheduling on Oakley, Ruby, and Owens now for further investigations.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

4:55PM 6/20/2017 Update: All HPC systems are stable operationally at the moment. We have narrowed the issues down to a line card in our virtual chassis fabric, and will schedule a meeting with network vendor to further diagnose the problem.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Original Post:

We are experiencing some kind of systemic problem with the HPC systems. Some login nodes required reboot last night, which is likely related to a larger underlying problem, which we believe may be a networking issue inside the data center. We are actively investigating the issue, and will update the community as more is known.

Search form

Systemic Problem on Cluster Computing service

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

Systemic Problem on Cluster Computing service

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links