Known issues

Unresolved known issues

Known issue with an Unresolved Resolution state is an active problem under investigation; a temporary workaround may be available.

Resolved known issues

A known issue with a Resolved (workaround) Resolution state is an ongoing problem; a permanent workaround is available which may include using different software or hardware.

A known issue with Resolved Resolution state has been corrected.

Known Issues

Title Category Resolutionsort descending Description Posted Updated
Symlinks to /fs/project directories were missed filesystem Resolved

Symlinks to /fs/project directories were missed for a short period of time both on Tuesday afternoon October 11(right after downtime) and Wedenday morning October 12. It might result in job... Read more

1 year 9 months ago 1 year 9 months ago
Network card re-seat Network Resolved

At 8AM on Tuesday, July 9th 2013, we will be re-seating a network card in a switch at our operations center. It is possible that a brief (~10 minute) outage may occur. Jobs will pause for the... Read more

11 years 3 weeks ago 11 years 2 weeks ago
Rolling reboot of Oakley and Ruby clusters, starting from 8:30AM October 9, 2017 Batch, login, Ruby Resolved

Updates on 1:00PM October 16, 2017: 

The rolling reboots of Oakley and Ruby are completed. 

... Read more
6 years 9 months ago 6 years 9 months ago
(informational) GPFS maintenance work duplicate known issue filesystem Resolved

Maintenance work on the GPFS servers is scheduled to be performed today, 28 Feb 2020 at 2:00p.m.

Although there is no direct impact expected to services at OSC, there may be short... Read more

4 years 5 months ago 4 years 5 months ago
Oakley login node instability Operations Resolved

Oakley login nodes are seeing some instability related to Lustre. We will reboot the nodes on Thursday, October 2nd 2014 to resolve the issue. If a login node crashes before then and we have the... Read more

9 years 10 months ago 9 years 9 months ago
Rolling reboots of Owens and Pitzer, starting from Tuesday, Jan 22, 2019 Batch, login, Owens Resolved

... Read more

5 years 6 months ago 5 years 6 months ago
Singularity: reached your pull rate limit Owens, Pitzer, Software Resolved
(workaround)

You might encounter an error while pulling a large Docker image:

ERROR: toomanyrequests: Too Many Requests.

or

You have reached your pull rate limit. You may... Read more          
3 years 1 month ago 2 years 3 months ago
Project space giving errors "No space left on device" filesystem Resolved

11/01/2016 11:52AM Update: This issue has been fixed. 

We have become aware of a problem with the Project storage space that gives errors "No space left on device". The... Read more

7 years 9 months ago 7 years 8 months ago
PyTorch jobs timeout and hanging GPU Resolved

We have observed that many PyTorch users frequently encounter random timeouts, which result in the termination of their jobs but leave the process running on the node.... Read more

1 year 2 weeks ago 6 months 3 weeks ago
Brief disruption to external network, 2013/12/29 Connectivity Resolved

Between 5:00AM and 9:00AM EDT on Sunday,... Read more

10 years 7 months ago 10 years 7 months ago

Pages