We had GPFS hang issue that caused unexpected job failure between 19:37 and 20:00 on 09/08/2016
OSC experienced errors with NFS services the morning of June 29 between 08:37 and 09:12 that may have caused some jobs to fail, or other unexpected behavior. The errors would have resulted in the failure of legacy paths to user home directories through /nfs/[01-18]. There were also mount failures of /nfs/gpfs on several nodes.
Update: Downtime completed at 6:30PM, June 7th.
The June 7th downtime is now slated to be completed at 6:30PM. Previous estimate was 5PM.
All systems and services will continue to be unavailable until that time.
Thank you for your cooperation.
We are currently investigating multiple reports of Globus Online transfers to/from OSC to other sites are failing. Transfers to/from Globus Personal Endpoints do not seem to be affected.
Please let us know if you exprience issues using Globus.
(9/8/15 14:21 Eastern) Project space appears to be back to normal operation. We are running some tests to verify that the problem is fully resolved.
As of early afternoon, Sept. 8, 2015, we are experiencing some problems with access to /nfs/gpfs. We are aware of the problem, and are working to resolve the issue and return the project space service to normal operation as soon as possible.
Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug. The bug (or issue otherwise) seems to be activated when a user does operations on a lustre directory that contains an excessive number of files (10000+ files).
Our support contacts have been contacted and we are working with them to resolve this issue. Updates will be posted both here.
As of 11:30PM on June 16th, we have removed the GPFS filesystem from service due to a number of hardware failures. We are working with our vendor support to identify and fix the underlying problem as soon as possible. We will provide updated information here when we have more information to share.
9/10/14 - We have not seen any additional crashes of the Lustre servers since making this change.
8/26/14 - Lustre jobs are being accepted as of 10AM this morning. Please report any problems to OSCHelp@osc.edu
8/25/14 - The system will allow jobs using Lustre starting tomorrow morning.
The Lustre filesystem ($PFSDIR and /fs/lustre) will be offline starting at 3:00pm on 8/25/2014.
The Lustre filesystem ($PFSDIR and /fs/lustre) has crashed several times Friday evening (8/15). We have degraded this service temporarily, while we work to isolate the actions that are triggering the crashes. DDN engineers have confirmed that we are experiencing a known bug, and are working with Intel to get us updated binaries. We are working to see if there are strategies we could utilize to return to full service in advance of installing updated binaries.
UPDATE: Most users should no longer see any issues with Lustre.
Again, please continue to notify OSC Help of any errors you see in job output. For example, you might see "IBV_EVENT_PORT_ERR" in your job output. Notifying the helpdesk quickly will help the Operations staff to reduce the effects of any issues.
We apologize for the disruption. We work hard to avoid these incidents, but sometimes they do happen. We appreciate your patience.