NFS service disruption 6/29/16

OSC experienced errors with NFS services the morning of June 29 between 08:37 and 09:12 that may have caused some jobs to fail, or other unexpected behavior.  The errors would have resulted in the failure of legacy paths to user home directories through /nfs/[01-18]. There were also mount failures of /nfs/gpfs on several nodes.

Problems with Project Space (/nfs/gpfs)

(9/8/15 14:21 Eastern) Project space appears to be back to normal operation. We are running some tests to verify that the problem is fully resolved.

As of early afternoon, Sept. 8, 2015, we are experiencing some problems with access to /nfs/gpfs. We are aware of the problem, and are working to resolve the issue and return the project space service to normal operation as soon as possible.

Lustre bug causing Oakley login node crashes

Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug.  The bug (or issue otherwise) seems to be activated when a user does operations on a lustre directory that contains an excessive number of files (10000+ files).  

Our support contacts have been contacted and we are working with them to resolve this issue.  Updates will be posted both here.

Unscheduled GPFS Outage

As of 11:30PM on June 16th, we have removed the GPFS filesystem from service due to a number of hardware failures. We are working with our vendor support to identify and fix the underlying problem as soon as possible. We will provide updated information here when we have more information to share.

Lustre Updates

9/10/14 - We have not seen any additional crashes of the Lustre servers since making this change.

- Lustre jobs are being accepted as of 10AM this morning.  Please report any problems to

8/25/14 - The system will allow jobs using Lustre starting tomorrow morning.

The Lustre filesystem ($PFSDIR and /fs/lustre) will be offline starting at 3:00pm on 8/25/2014.

Lustre jobs suspended

The Lustre filesystem ($PFSDIR and /fs/lustre) has crashed several times Friday evening (8/15). We have degraded this service temporarily, while we work to isolate the actions that are triggering the crashes. DDN engineers have confirmed that we are experiencing a known bug, and are working with Intel to get us updated binaries. We are working to see if there are strategies we could utilize to return to full service in advance of installing updated binaries.


Subscribe to filesystem