We will have rolling reboots of all three clusters starting 9:30 AM June 05, 2019.
We will perform the replacement work of Ethernet switches from 12pm to 3pm on Thursday, Jan 17, which includes all login nodes and 2 quick nodes on Owens. As a result, users won't be able to log into Owens at the beginning and end of the maintenance work, and won't be able to use Owens VDI through OnDemand during the entire maintenance window. Running jobs on Owens, as well as other OSC services (Pitzer, Ruby, and fileystems) won't be impacted.
Rolling reboots of Owens and Pitzer, starting from Tuesday, Jan 22, 2019
All the services are back.
We will have rolling reboot of login nodes of clusters at 7:00AM Dec 19, 2017 for GPFS version upgrade. It is supposed to be completed in a short period of time. f you encounter any login issues, please try again after a few minutes.
We apologize for any inconvenience this may cause you. Please contact firstname.lastname@example.org if you have any questions.
We will have rolling reboots of Oakley and Ruby clusters starting from 8:30AM on Monday October 9, 2017.
4:56PM 3/28/2017 Update: The rolling reboots of all systems are completed.
We upgraded to RHEL 6.8 for both Oakley and Ruby clusters during the October 12th's downtime. Unfortunately, we are noticing some NFS problem that has been causing rsh, or ssh sessions to hang on Oakley and Ruby. To resolve this issue, we've downgraded the kernel version to one that is not exhibiting the NFS regression, and started to reboot compute nodes on Oakley and Ruby. It won't affect any running jobs, but users may experience longer queue wait time on Oakley and Ruby.
Update: Downtime completed at 6:30PM, June 7th.
The June 7th downtime is now slated to be completed at 6:30PM. Previous estimate was 5PM.
All systems and services will continue to be unavailable until that time.
Thank you for your cooperation.
Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug. The bug (or issue otherwise) seems to be activated when a user does operations on a lustre directory that contains an excessive number of files (10000+ files).
Our support contacts have been contacted and we are working with them to resolve this issue. Updates will be posted both here.