Ascend

PyTorch hangs on dual-gpu node on Ascend

PyTorch can hang on Ascend on dual-GPU nodes

Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why setting NCCL_P2P_DISABLE is necessary even for single-node setups. However, the performance impact should be minimal. This is because, on dual-GPU nodes, the GPUs are connected via the SMP interconnect across NUMA nodes. As there’s no NVLink on these nodes, GPU communication occurs through shared memory

Workaround

To get around this set the environment variable NCCL_P2P_DISABLE=1