PyTorch hangs on dual-gpu node on Ascend
PyTorch can hang on Ascend on dual-GPU nodes
Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why setting NCCL_P2P_DISABLE is necessary even for single-node setups. However, the performance impact should be minimal. This is because, on dual-GPU nodes, the GPUs are connected via the SMP interconnect across NUMA nodes. As there’s no NVLink on these nodes, GPU communication occurs through shared memory
Workaround
To get around this set the environment variable NCCL_P2P_DISABLE=1