We have returned to service. It appears that we have resolved the networking issues enough to allow jobs to run safely. We will continue working with our vendors to fix any remaining hardware issues, and may need to schedule a short downtime in the near future to make a few more permanent changes to the infrastructure.
We are experiencing technical difficulties with Lustre and InfiniBand, and were seeing a high failure rate on parallel jobs. As a result, early on 7/30 we suspended the job scheduler and further investigated the issue.