Temporary Login Node Instability on Ascend
We are currently experiencing temporary instability on the Ascend login nodes.
We are currently experiencing temporary instability on the Ascend login nodes.
After upgrading the operating system to RHEL 9.6 during the scheduled downtime on May 12, 2026, applications utilizing UCX (rc_gda) or GPU-initiated networking may experience failures.
Benchmark AUSURF112 for quantum-espresso/7.4.1 on Ascend aborts. We suspect that this is a lurking bug in Quantum Espresso and are reporting it as a convenience. Concerned users can use Cardinal or Pitzer as a workaround.
Update on November 24, 2025: The issue is resolved with MVAPICH 4 variants.
On all clusters the cp2k executables from module cp2k/2023.2 can produce huge output files due to many many repeating errors from MKL, e.g.:
MKL module files define some helper environment variables with incorrect paths. This can yield link time errors. All three clusters are affected. We are working to correct the module files. A workaround for users is to redefine the environment variable with the correct path; this requires some computational maturity. We recommend users contact oschelp@osc.edu for assistance. An example error from Cardinal with module intel-oneapi-mkl/2023.2.0 that defined environment variable MKL_LIBS_INT64 follows:
Users may encounter the following message and experience NCCL hangs if the first operation is a barrier when running multi-GPU training. We have identified that this issue occurs only on a single Ascend Next Gen (dual-GPU) node where the GPUs are connected via the SMP interconnect across NUMA nodes, rather than through NVLink.
PyTorch can hang on Ascend on dual-GPU nodes
Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why setting NCCL_P2P_DISABLE is necessary even for single-node setups. However, the performance impact should be minimal. This is because, on dual-GPU nodes, the GPUs are connected via the SMP interconnect across NUMA nodes. As there’s no NVLink on these nodes, GPU communication occurs through shared memory
To get around this set the environment variable NCCL_P2P_DISABLE=1
A pure MPI application using mpirun or mpiexec with more ranks than the number of NUMA nodes may encounter an error similar to the following:
When running a full-node MPI job with MVAPICH 3.0 , you may encounter the following warning message:
When running MPI+OpenMP hybrid code with the Intel Classic Compiler and MVAPICH 3.0, you may encounter the following warning message from hwloc: