PyTorch jobs timeout and hanging |
GPU |
Resolved |
We have observed that many PyTorch users frequently encounter random timeouts, which result in the termination of their jobs but leave the process running on the node.... Read more |
1 year 5 months ago |
11 months 1 week ago |
qsub filter rejects valid jobs |
|
Resolved |
Job scripts submitted on Glenn, Oakley, or Ruby all go a submit filter before reaching the resource manager, Torque. A bug has been discovered in our submit filter which prevents jobs with the... Read more |
9 years 8 months ago |
9 years 2 months ago |
quota exceeded error when using chgrp in /fs/ess directories |
filesystem |
Resolved |
Users may receive an error when using the chgrp command on data in /fs/ess/ locations.
$ chgrp -v PEX1234 my-file.txt
chgrp: changing group of 'my-file.txt': Disk quota exceeded
failed... Read more |
1 year 10 months ago |
1 year 9 months ago |
Replacement of Owens Ethernet switches from Dec 14, 2018 |
Network, Owens |
Resolved |
Updated on Jan 16, 2019, at 09:20 AM:
The replacement is done except for the three switches including the login nodes of Owens. We posted another notice for more... Read more |
6 years 3 months ago |
5 years 11 months ago |
Rolling reboot of all clusters, starting from 8 AM Tuesday, June 19, 2018 |
Batch, Owens, Ruby |
Resolved |
Posted on June 12, 2018, at 4:40 PM:
We will have rolling reboots of three clusters (Owens, Ruby, and Oakley) including login and compute nodes, starting from 8 AM Tuesday... Read more |
6 years 6 months ago |
6 years 5 months ago |
Rolling reboot of all clusters, starting from 9:30 AM June 05, 2019 |
Batch, login, Owens, Pitzer, Ruby |
Resolved |
Update #2 Posted on 14 June 2019 12:33 PM
The rolling reboots of all clusters are completed. Please contact oschelp@osc.edu if you... Read more |
5 years 6 months ago |
5 years 6 months ago |
Rolling reboot of all clusters, starting from Wednesday morning, April 19, 2017 |
Batch, Maintenance, Owens, Ruby |
Resolved |
1:40PM 4/27/2017 Update: Rolling reboots are completed.
3:10PM 4/18/2017 Update: Rolling reboots on Owens have started to address GPFS errors occured... Read more |
7 years 8 months ago |
7 years 7 months ago |
Rolling reboot of Ascend, Owens and Pitzer starting from Oct 25 2023 |
Owens, Pitzer |
Resolved |
Update on Nov 8 2023:
Rolling reboots of all clusters are completed.
Update on Nov 3 2023:
Rolling reboots of Ascend and Pitzer clusters... Read more |
1 year 1 month ago |
1 year 1 month ago |
Rolling reboot of compute and login nodes of all clusters, starting from Wednesday morning, March 22, 2017 |
login, Owens, Ruby |
Resolved |
4:56PM 3/28/2017 Update: The rolling reboots of all systems are completed.
All compute nodes and login nodes of Owens, Oakley, and Ruby clusters will need to be rebooted... Read more |
7 years 9 months ago |
7 years 8 months ago |
Rolling reboot of login nodes of clusters at 7:00AM Dec 19, 2017 |
login |
Resolved |
We will have rolling reboot of login nodes of clusters at 7:00AM Dec 19, 2017 for GPFS version upgrade. It is supposed to be completed in a short period of time. f you encounter any login issues,... Read more |
6 years 12 months ago |
6 years 11 months ago |