We are currently experiencing temporary instability on the Ascend login nodes.

A rolling reboot is in progress to address CVE-2026-23111 for all clusters, including Ascend, Cardinal, and Pitzer.

Known issues

Unresolved known issues

Known issue with an Unresolved Resolution state is an active problem under investigation; a temporary workaround may be available.

Resolved known issues

A known issue with a Resolved (workaround) Resolution state is an ongoing problem; a permanent workaround is available which may include using different software or hardware.

A known issue with Resolved Resolution state has been corrected.

Known Issues

Title Category Resolutionsort ascending Description Posted Updated
PyTorch hangs on dual-gpu node on Ascend Ascend, GPU Resolved
(workaround)

PyTorch can hang on Ascend on dual-GPU nodes

Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why...

Read more
1 year 1 month ago 1 year 1 month ago
Ascend desktop including lightweight is not working Resolved

Update: this is fixed. 

Original Post:

Ascend Desktop, including...

Read more
1 year 2 months ago 1 year 2 months ago
OpenMPI 4 and NVHPC MPI Compatibility Issues with SLURM HWLOC Ascend, Cardinal, Software Resolved
(workaround)

A pure MPI application using mpirun or mpiexec with more ranks than the number of NUMA nodes may encounter an error similar to the following:...

Read more
1 year 3 months ago 12 months 4 days ago
Abaqus Parallel Job Failure with PMPI Due to Out-of-Memory (OOM) Error Cardinal Resolved
(workaround)

You may encounter the following error while running an Abaqus parallel job with PMPI:

Traceback (most recent call last):
 File "SMAPylModules/SMAPylDriverPy.m/src/driverAnalysis.py",...
Read more
1 year 5 months ago 1 year 1 month ago
- --gpus-per-task is not working Batch Resolved

Updated: This is fixed. 

Original Post:

After the recent Slurm upgrade, the option --gpus-per-task is currently not functioning as...

Read more
1 year 5 months ago 1 year 5 months ago
Core label on OnDemand app is incorrect OnDemand Resolved

The core label on the OnDemand app incorrectly displays as '1', regardless of the requested number of cores for a job. While this label is incorrect, the job is still allocated the correct number...

Read more
1 year 5 months ago 1 year 4 months ago
Core and Node labels on Classroom app are incorrect Resolved

The core and node labels on the Classroom app (class.osc.edu) incorrectly displays as '0', regardless of the requested number of cores for a job. While this label is incorrect, the job is still...

Read more
1 year 5 months ago 1 year 4 months ago
Ansys OMP: System error #22: Invalid argument Cardinal Resolved
(workaround)

You may encounter the following error while running Ansys on Cardinal:

OMP: Error #100: Fatal system error detected.
OMP: System error #22: Invalid argument
forrtl: error (76): Abort...
Read more
1 year 5 months ago 1 year 1 month ago
LS-DYNA mpp-dyna Cardinal: Remote access error on mlx5_0:1, RDMA_READ Cardinal, Software Resolved
(workaround)

You may encounter the following error while running mpp-dyna jobs with multiple nodes:

[c0054:22206:0:22206] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88...
Read more
1 year 4 months ago 12 months 4 days ago
BWA 0.7.17 vulnerability Cardinal Resolved
(workaround)

Cardinal hosted a version of bwa that had an unpatched vulnerability, 0.7.17. 

This version has been removed from Cardinal in favor of 0.7.18

1 year 3 months ago 1 year 1 month ago

Pages