Lustre jobs suspended

Category: 
Resolution: 
Resolved

The Lustre filesystem ($PFSDIR and /fs/lustre) has crashed several times Friday evening (8/15). We have degraded this service temporarily, while we work to isolate the actions that are triggering the crashes. DDN engineers have confirmed that we are experiencing a known bug, and are working with Intel to get us updated binaries. We are working to see if there are strategies we could utilize to return to full service in advance of installing updated binaries.

Meanwhile, jobs submitted from /fs/lustre or that use $PFSDIR are being rejected at queue time. If you symlink /fs/lustre your job may not be rejected by the queuing system but may be at risk of failure due to filesystem crashes.

So far, the system has appeared stable since we have stopped scheduling Lustre jobs.

You may be able to move your data to GPFS to continue your work during this period, depending on how you are utilizing Lustre. If you need assistance, or are unsure how you should react, please contact OSCHelp via email (oschelp@osc.edu).

We apologize for the disruption and appreciate your patience. 

NOTE: 08/25/2014 the system will begin accepting jobs using Lustre tomorrow morning.