filesystem

Problems with Project Space (/nfs/gpfs)

(9/8/15 14:21 Eastern) Project space appears to be back to normal operation. We are running some tests to verify that the problem is fully resolved.


As of early afternoon, Sept. 8, 2015, we are experiencing some problems with access to /nfs/gpfs. We are aware of the problem, and are working to resolve the issue and return the project space service to normal operation as soon as possible.

Lustre bug causing Oakley login node crashes

Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug.  The bug (or issue otherwise) seems to be activated when a user does operations on a lustre directory that contains an excessive number of files (10000+ files).  

Our support contacts have been contacted and we are working with them to resolve this issue.  Updates will be posted both here.

Unscheduled GPFS Outage

As of 11:30PM on June 16th, we have removed the GPFS filesystem from service due to a number of hardware failures. We are working with our vendor support to identify and fix the underlying problem as soon as possible. We will provide updated information here when we have more information to share.

Lustre Updates

9/10/14 - We have not seen any additional crashes of the Lustre servers since making this change.

8/26/14 
- Lustre jobs are being accepted as of 10AM this morning.  Please report any problems to OSCHelp@osc.edu

8/25/14 - The system will allow jobs using Lustre starting tomorrow morning.

The Lustre filesystem ($PFSDIR and /fs/lustre) will be offline starting at 3:00pm on 8/25/2014.

Lustre jobs suspended

The Lustre filesystem ($PFSDIR and /fs/lustre) has crashed several times Friday evening (8/15). We have degraded this service temporarily, while we work to isolate the actions that are triggering the crashes. DDN engineers have confirmed that we are experiencing a known bug, and are working with Intel to get us updated binaries. We are working to see if there are strategies we could utilize to return to full service in advance of installing updated binaries.

Lustre, Infiniband Operational and Being Monitored Closely

UPDATE: Most users should no longer see any issues with Lustre.


Again, please continue to notify OSC Help of any errors you see in job output. For example, you might see "IBV_EVENT_PORT_ERR" in your job output. Notifying the helpdesk quickly will help the Operations staff to reduce the effects of any issues.

We apologize for the disruption. We work hard to avoid these incidents, but sometimes they do happen. We appreciate your patience.

Brief disruption of GPFS on 8/28/2013

On the morning August 28th, 2013 we will briefly disrupt the GPFS filesystem to reboot servers. This is necessary to upgrade the GPFS system. The in-place upgrade should only briefly interrupt service to the GPFS filesystem, and the outage should only last a few minutes. Running jobs should pause until the system returns to service.

Please contact OSC Help if you experience any problems.

Poor network performance on some filesystems

We are experiencing some network performance issues on a cluster of servers involved with providing GPFS and some project filesystems. GPFS appears to be functioning acceptably, but proj01, proj02, proj03, proj08, and proj09 are not. Compute nodes attempting to write to these filesystems will see very slow write speeds.

The root cause has been identified as a damaged fiber optic cable. We will be replacing this cable, and expect an outage of less than one minute to the affected hosts.

Pages