We have returned to service. It appears that we have resolved the networking issues enough to allow jobs to run safely. We will continue working with our vendors to fix any remaining hardware issues, and may need to schedule a short downtime in the near future to make a few more permanent changes to the infrastructure.
We are experiencing technical difficulties with Lustre and InfiniBand, and were seeing a high failure rate on parallel jobs. As a result, early on 7/30 we suspended the job scheduler and further investigated the issue.
We have stopped both GPFS and Lustre, and rebooted all network switches to attempt to address the network routing issues. Unfortunately, this did not resolve all issues. We have engineers from HP involved, and are working on fixing some hardware that does not appear to be operating properly.
We will not be able to return to service until 7/31. We are hoping to have systems in an operational state by about 5PM.
If you suspect your job has failed due to this issue, you may contact OSC Help and we will refund RU's or give queue priority on a case-by-case basis. More information will be provided via the @HPCNotices Twitter, and via notices on the Supercomputing page.
Thank you for your patience as we work to rectify these issues.