It is planned to perform switch reboot starting at 8AM, Monday Feb 5, 2024. During the reboot, pitzer-login02 and pitzer-login04 nodes are not available while the rest are working. The anticipated completion time for the work is 9 AM on February 5, 2024
We have scheduled a Slurm database repair, which is planned to start at 8:30 am US/Eastern on Thursday, January 25, 2024. During the repair, Slurm database will be offline; running jobs and OnDemand jobs will not be affected; any jobs using the '--clusters/-M' flag will receive errors; commands including sacct will be unavailable. The anticipated duration of the outage is expected to be up to 2 hours.
A downtime for OSC HPC systems is scheduled from 7 a.m. to 9 p.m., Tuesday, December 19, 2023. The downtime will affect the Pitzer, Owens and Ascend Clusters, web portals, HPC file servers, and state-wide licenses. MyOSC (the client portal) will be available during the downtime. In preparation for the downtime, the batch scheduler will not start jobs that cannot be completed before 7 a.m., December 19. Jobs that are not started on clusters will be held until after the downtime and then started once the system is returned to production status.
The Slurm upgrades during rolling reboots of Ascend, Owens and Pitzer we performed today (Oct 25 2023) cause all running jobs on the systems requeued around 8:45am. You will not be billed for the consumed resources before the jobs were requeued.
We will have rolling reboots of Ascend, Owens and Pitzer clusters including login and compute nodes, starting from 9AM Wednesday October 25, to perform NVIDIA driver and Slurm upgrades. At the start of the rolling reboot all login nodes will be unavailable for about 10 minutes. The rolling reboots won't affect any running batch jobs, but users may experience longer queue wait time than usual on the clusters.
A downtime for OSC HPC systems is scheduled from 7 a.m. to 9 p.m., Tuesday, August 8, 2023. The downtime will affect the Pitzer, Owens and Ascend Clusters, web portals, and HPC file servers. MyOSC and state-wide licenses will be available during the downtime. See this link for more details: https://www.osc.edu/calendar/events/2023_08_08-system_downtime_august_8_2023
Beginning 9am on Wednesday, June 14, we will be having rolling reboots of all clusters (Owens, Pitzer, and Ascend). This includes both login and compute nodes and will be done to perform OS updates. Login nodes will be rebooted at 9am on June 14, which will leave them unavailable for approximately 5-10 minutes. The rolling reboots won't affect any running batch jobs, but users may experience longer than usual wait times on all clusters.
There are changes on MPI libraries on Pitzer after May 20. We will upgrade MOFED from 4.9 to 5.6 and recompile all OpenMPI and Mvapich2 against the newer MOFED version. Users with their own MPI libraries may see job failures and will need to rebuild their applications linked against the MPI libraries.
Beginning Saturday, May 20, 2023, at 7 a.m., there will be an outage of most of 48core Pitzer and a few Owens nodes for UPS repair work. We anticipate this work to be completed at 2 p.m. on May 20.