NCCL hang on Ascend dual-GPU nodes

Category:

Ascend

GPU

Software

Resolution:

Resolved

Workaround Link:

Workaround Link

Affected Software:

NCCL

Users may encounter the following message and experience NCCL hangs if the first operation is a barrier when running multi-GPU training. We have identified that this issue occurs only on a single Ascend Next Gen (dual-GPU) node where the GPUs are connected via the SMP interconnect across NUMA nodes, rather than through NVLink.

Workaround

As a workaround, set the environment variable NCCL_P2P_DISABLE=1 to disable peer-to-peer GPU communication via NVLink or PCIe. While we are still investigating why this setting is required even for single-node setups, the performance impact appears to be minimal.

This minimal impact is expected because, on these dual-GPU nodes, the GPUs communicate through the SMP interconnect and shared memory across NUMA nodes, not via NVLink.

Ohio Department of Higher Education

25 South Front Street
Columbus, Ohio 43215

State Government Links

Mike DeWine, Governor | Ohio.gov
Ohio Checkbook

Education Links

Ohio Department of Higher Education
Ohio Technology Consortium
OARnet | OSC | OhioLINK
OACC | IUC | OTTA | ODE

Search form

NCCL hang on Ascend dual-GPU nodes

Workaround

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

NCCL hang on Ascend dual-GPU nodes

Workaround

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links