Users may encounter the following message and experience NCCL hangs if the first operation is a barrier when running multi-GPU training. We have identified that this issue occurs only on a single Ascend Next Gen (dual-GPU) node where the GPUs are connected via the SMP interconnect across NUMA nodes, rather than through NVLink.
Workaround
As a workaround, set the environment variable NCCL_P2P_DISABLE=1
to disable peer-to-peer GPU communication via NVLink or PCIe. While we are still investigating why this setting is required even for single-node setups, the performance impact appears to be minimal.
This minimal impact is expected because, on these dual-GPU nodes, the GPUs communicate through the SMP interconnect and shared memory across NUMA nodes, not via NVLink.