As of 11:30PM on June 16th, we have removed the GPFS filesystem from service due to a number of hardware failures. We are working with our vendor support to identify and fix the underlying problem as soon as possible. We will provide updated information here when we have more information to share.
9/10/14 - We have not seen any additional crashes of the Lustre servers since making this change.
8/26/14 - Lustre jobs are being accepted as of 10AM this morning. Please report any problems to OSCHelp@osc.edu
8/25/14 - The system will allow jobs using Lustre starting tomorrow morning.
The Lustre filesystem ($PFSDIR and /fs/lustre) will be offline starting at 3:00pm on 8/25/2014.
The Lustre filesystem ($PFSDIR and /fs/lustre) has crashed several times Friday evening (8/15). We have degraded this service temporarily, while we work to isolate the actions that are triggering the crashes. DDN engineers have confirmed that we are experiencing a known bug, and are working with Intel to get us updated binaries. We are working to see if there are strategies we could utilize to return to full service in advance of installing updated binaries.
UPDATE: Most users should no longer see any issues with Lustre.
Again, please continue to notify OSC Help of any errors you see in job output. For example, you might see "IBV_EVENT_PORT_ERR" in your job output. Notifying the helpdesk quickly will help the Operations staff to reduce the effects of any issues.
We apologize for the disruption. We work hard to avoid these incidents, but sometimes they do happen. We appreciate your patience.
We are currently experiencing difficulties with the servers for the filesystem mounted at /nfs/proj13.
On the morning August 28th, 2013 we will briefly disrupt the GPFS filesystem to reboot servers. This is necessary to upgrade the GPFS system. The in-place upgrade should only briefly interrupt service to the GPFS filesystem, and the outage should only last a few minutes. Running jobs should pause until the system returns to service.
Please contact OSC Help if you experience any problems.
We are experiencing some network performance issues on a cluster of servers involved with providing GPFS and some project filesystems. GPFS appears to be functioning acceptably, but proj01, proj02, proj03, proj08, and proj09 are not. Compute nodes attempting to write to these filesystems will see very slow write speeds.
The root cause has been identified as a damaged fiber optic cable. We will be replacing this cable, and expect an outage of less than one minute to the affected hosts.
Today, May 14 2013, at 12:45PM we will be temporarily removing one of the home directory servers from service to address some reliability issues. Users with home directories under /nfs/13 will experience a brief disruption in service. Logins will likely fail, and jobs will not run for these users.
We expect this outage to last for 15-30 minutes.