After upgrading the operating system to RHEL 9.6 during the scheduled downtime on May 12, 2026, applications utilizing UCX (rc_gda) or GPU-initiated networking may experience failures.
The issue occurs because CUDA fails to register the mlx5 DevX UAR (User Access Region) doorbell page to make it visible to CUDA device code.
The failing operation typically looks like this:
cuMemHostRegister(uar->reg_addr, 256, CU_MEMHOSTREGISTER_PORTABLE | CU_MEMHOSTREGISTER_DEVICEMAP | CU_MEMHOSTREGISTER_IOMEMORY)
This call returns CUDA_ERROR_INVALID_VALUE.
In newer enterprise Linux kernels, the kernel's internal memory lookup mechanism follow_pfn() has been deprecated or removed. The functional replacements for this symbol in the kernel are licensed as GPL-only. Because the proprietary NVIDIA driver cannot use GPL-only symbols, it is unable to perform the necessary MMIO/PFNMAP memory lookups required to map the NIC memory directly onto the GPU device.
OSC plans to switch from the proprietary NVIDIA driver to the NVIDIA open-source driver on both Cardinal and Ascend clusters. It is currently under testing. We will provide an update once it is ready to roll out.