OSC internal network problems 25 Sept. 2020
OSC is currently experiencing problems with its internal network. Interactive sessions may be slow or unresponsive, but running jobs should not be affected.
OSC is currently experiencing problems with its internal network. Interactive sessions may be slow or unresponsive, but running jobs should not be affected.
All user-facing issues have been resolved and the services are back.
OSC will replace the Ethernet switches in the Owens cluster starting from Dec 14
We have returned to service. It appears that we have resolved the networking issues enough to allow jobs to run safely. We will continue working with our vendors to fix any remaining hardware issues, and may need to schedule a short downtime in the near future to make a few more permanent changes to the infrastructure.
We are experiencing technical difficulties with Lustre and InfiniBand, and were seeing a high failure rate on parallel jobs. As a result, early on 7/30 we suspended the job scheduler and further investigated the issue.
At 8AM on September 11, 2013, we will be rebooting a network switch to replace a failed card in the switch. Network will be disrupted for 10 to 15 minutes while the work is done. Filesystem mounts may experience difficulties, and running jobs may hang for the duration of the reboot, but resume without failure. If you experience any unexpected problems, please contact OSC Help.
At 8AM on the morning of 8/1/2013, we will be replacing some faulty hardware in our network infrastructure. Unfortunately, this work cannot be delayed until the next downtime, and the replacement will cause a short disruption of network services for our compute nodes. Jobs may temporarily hang, if they are attempting to communicate with network provided storage or communicate between nodes. It is possible that a few jobs may actually fail to complete properly, but only under a very specific set of circumstances.
At 8AM on Tuesday, July 9th 2013, we will be re-seating a network card in a switch at our operations center. It is possible that a brief (~10 minute) outage may occur. Jobs will pause for the duration of any outage, and resume once the network becomes available again. If a job's walltime expires during an outage, the job may be terminated. Connections to OSC systems may be terminated, and attempts to log in may generate a "no route to host" error. Please contact OSC Help if you have any concerns.