We have been experiencing an issue with the Ethernet switches in the Owens cluster, which may potentially kill the running jobs on Owens. We have been monitoring this issue closely and reserving nodes with the switch errors for emergency maintenance. So far, no running job has been killed due to this issue based on our monitoring. Oakley and Ruby clusters are not affected. A possible Owens outage will happen for the permanent fix of this issue. We'll provide updates as we learn more from the vendor. We apologize for any inconvenience this may cause you.
Owens, Ruby, and Oakley clusters will have OS kernel and BIOS updates to address the latest Intel processor security vulnerabilities. As part of these updates it is necessary to update the OS distribution on Ruby and Oakley from Red Hat version 6.9 to 6.10. These changes will be applied via rolling reboots of compute nodes. Owens and Ruby will also have a software refresh at this time. For the most updated information on the software refresh, please see https://bit.ly/2PANrtZ
We will have rolling reboots of Owens and Ruby clusters including login and compute nodes, starting from 8 AM Monday, August 6, 2018. This rolling reboot is to make the latest CUDA 9 and dependent software available with the updated device driver, for the preparation of the software refresh at the end of August. The rolling reboots won't affect any running jobs, but users may experience longer queue wait time than usual on the cluster. The Oakley cluster will not be affected. For the most updated information, please see https://bit.ly/2Md5uod
CMake is a family of compilation tools that can be used to build, test and package software.
Availability and Restrictions
The current versions of CMake available at OSC are:
A downtime is scheduled for all HPC systems starting July 17, 2018, beginning at 7:00 A.M. and scheduled to finish by 5 P.M. The downtime will affect the Oakley, Owens, Ruby clusters. Login services and access to storage will not be available during this time. In preparation for the downtime the batch scheduler will begin holding jobs that cannot complete before July 17 at 7:00 A.M.
We will have rolling reboots of three clusters (Owens, Ruby, and Oakley) including login and compute nodes, starting from 8 AM Tuesday, June 19, 2018. The rolling reboots will address recent Kernel vulnerabilities patched by RedHat. The rolling reboots won't affect any running jobs, but users may experience longer queue wait time than usual on the cluster. For the most updated information, please see https://bit.ly/2sRW3Ts
A reboot of the license server is scheduled from 9 a.m. Tuesday, May 29. It is expected to take 10 minutes to complete this reboot. During the reboot, we will pause the scheduling so no new job will be scheduled on all three clusters (Oakley, Ruby, and Owens). It won't impact most of the running jobs, but running jobs using special software packages may fail due to license error. For more information, see this link: https://bit.ly/2krk52Y