Unscheduled GPFS Outage

Category: 
Resolution: 
Resolved

As of 11:30PM on June 16th, we have removed the GPFS filesystem from service due to a number of hardware failures. At this point, further hardware failures would put a large portion of the entire array in jeopardy, potentially resulting in the need to restore hundreds of terabytes from tape backups. We have decided that taking the system offline now to work towards addressing the problem may prevent a much longer outage.

We are working with our vendor support to identify and fix the underlying problem as soon as possible. We will provide updated information here when we have more information to share.

UPDATE 6/18 11AM: We have received replacement hardware from the vendor for the failed controller, and have already been actively rebuilding arrays with failed disks. We cannot yet share an estimated time for return to production status, but we are actively investigating alternatives to allow a return to service in a degraded state more quickly, while completing repairs for optimum system health in the background.

UPDATE 6/19 7PM: GPFS has been returned to service.