We are currently experiencing temporary instability on the Ascend login nodes.

A rolling reboot is in progress to address CVE-2026-23111 for all clusters, including Ascend, Cardinal, and Pitzer.

Known issues

Unresolved known issues

Known issue with an Unresolved Resolution state is an active problem under investigation; a temporary workaround may be available.

Resolved known issues

A known issue with a Resolved (workaround) Resolution state is an ongoing problem; a permanent workaround is available which may include using different software or hardware.

A known issue with Resolved Resolution state has been corrected.

Known Issues

Title Category Resolutionsort descending Description Posted Updated
Problems with Project Space (/nfs/gpfs) filesystem Resolved

(9/8/15 14:21 Eastern) Project space appears to be back to normal operation. We are running some tests to verify that the problem is fully resolved.


As of early afternoon, Sept. 8,...

Read more
10 years 9 months ago 10 years 9 months ago
Segmentation fault from openmpi/1.10-hpcx and 2.0-hpcx on Owens Owens, Software Resolved

We have found that recent MPI jobs using openmpi/1.10-hpcx and openmpi/2.0-hpcx on Owens may complete or hang until the job is killed, but receive segmentation fault. Some applications might be ...

Read more
6 years 10 months ago 6 years 10 months ago
Correction on OSC project restriction email Account Management Resolved

Updated:

OSC has resolved this morning's issue and reverted impacted projects back to an ACTIVE status. Queued jobs under those projects will be able to start once today's...

Read more
4 years 11 months ago 4 years 11 months ago
ondemand gpu request error Nov 2021 Batch, OnDemand, Pitzer Resolved

When requesting an interactive session in ondemand and requesting gpu resources, users may see an error similar similar to  "sbatch: error: Invalid generic resource (gres) specification"

...

Read more
4 years 6 months ago 4 years 6 months ago
Poor network performance on some filesystems filesystem Resolved

We are experiencing some network performance issues on a cluster of servers involved with providing GPFS and some project filesystems. GPFS appears to be functioning acceptably, but proj01, proj02...

Read more
12 years 10 months ago 12 years 10 months ago
MPI fails with UCX 1.18 Software Resolved
(workaround)

After the downtime on August 19, 2025, users may encounter UCX errors such as:

UCX ERROR no active messages transport to <no debug data>: self/memory -...
Read more
9 months 3 weeks ago 8 months 1 week ago
Rolling reboot of Owens cluster, starting from 8:30AM Oct 30, 2017 Batch, Owens Resolved

Updated on Nov 21, 2017 at 3:33PM:

It has been completed. 

Updated on October 20, 2017 at 4:19PM:

We will have a rolling reboot of Owens...

Read more
8 years 7 months ago 8 years 6 months ago
MOE license server down Licensing Resolved

The MOE license server is experiencing an unknown issue and potentially down.  We are working to resolve the issue.

2 years 8 months ago 2 years 8 months ago
Scheduling suspended Batch Resolved

We have temporarily suspended scheduling due to some problems with the parallel scratch file system.

11 years 8 months ago 11 years 8 months ago
vLLM versions prior to 0.14.1 are deprecated Software Resolved

vLLM versions prior to 0.14.1 are deprecated due to security issue CVE-2026-22778 which can allow remote code execution.  Clients are advised to use versions 0....

Read more
4 months 2 hours ago 4 months 2 hours ago

Pages