Emergency InfiniBand Shutdown (All systems)

Category:

Resolution:

Resolved

We have returned to service. It appears that we have resolved the networking issues enough to allow jobs to run safely. We will continue working with our vendors to fix any remaining hardware issues, and may need to schedule a short downtime in the near future to make a few more permanent changes to the infrastructure.

We are experiencing technical difficulties with Lustre and InfiniBand, and were seeing a high failure rate on parallel jobs. As a result, early on 7/30 we suspended the job scheduler and further investigated the issue.

We have stopped both GPFS and Lustre, and rebooted all network switches to attempt to address the network routing issues. Unfortunately, this did not resolve all issues. We have engineers from HP involved, and are working on fixing some hardware that does not appear to be operating properly.

We will not be able to return to service until 7/31. We are hoping to have systems in an operational state by about 5PM.

If you suspect your job has failed due to this issue, you may contact OSC Help and we will refund RU's or give queue priority on a case-by-case basis. More information will be provided via the @HPCNotices Twitter, and via notices on the Supercomputing page.

Thank you for your patience as we work to rectify these issues.

Upcoming Events

Virtual Help Desk

Jul 21 2026 - 2:00pm to 4:00pm

Virtual Help Desk

Jul 22 2026 - 2:00pm to 4:00pm

Virtual Help Desk

Jul 28 2026 - 2:00pm to 4:00pm

Virtual Help Desk

Jul 29 2026 - 2:00pm to 4:00pm

Search form

Emergency InfiniBand Shutdown (All systems)

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

Emergency InfiniBand Shutdown (All systems)

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links