NCCL

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Availability and Restrictions

Versions

NCLL is available on OSC Clusters. The versions currently available at OSC are:

Version Pitzer Ascend Cardinal
2.19.3-1 X X X*

* Current default version

You can use module spider nccl to view available modules for a given machine. Feel free to contact OSC Help if you need other versions for your work.

Access

NCCL is available to all OSC users. If you have any questions, please contact OSC Help.

Publisher/Vendor/Repository and License Type

NVIDIA, see NVIDIA's links listed here for licensing.

SLA
This document is the Software License Agreement (SLA) for NVIDIA NCCL. The following contains specific license terms and conditions for NVIDIA NCCL. By accepting this agreement, you agree to comply with all the terms and conditions applicable to the specific product(s) included herein.
 
BSD License
This document is the Berkeley Software Distribution (BSD) license for NVIDIA NCCL. The following contains specific license terms and conditions for NVIDIA NCCL open sourced. By accepting this agreement, you agree to comply with all the terms and conditions applicable to the specific product(s) included herein.

Usage

Performance

The performance results were obtained by running NVIDIA NCCL Tests. The tests were built with NCCL 2.19.3, CUDA 12, and OpenMPI 5. Each performance value represents the average of five runs using a 512MB message size. The total number of ranks for each test was configured as follows:

  • Single-node Allreduce: -g $SLURM_GPUS_PER_NODE -t 1
  • Single-node SendRecv: -g 2 -t 1
  • Node-to-node: srun -N 2 --ntasks-per-node=1 with -g 1 -t 1

Note: For Ascend dual-GPU nodes, the environment variable NCCL_P2P_DISABLE was set to 1 due to a known issue.

Cluster Single Node Node to Node
  SendRecv Allreduce SendRecv Allreduce
Cardinal 124 GB/s 240 GB/s 28.8 GB/s 46.7 GB/s
Ascend (quad) 72 GB/s 144 GB/s 6.3 GB/s 6.3 GB/s
Ascend (dual) 11.8 GB/s 12.0 GB/s 9.5 GB/s 9.5 GB/s
Pitzer 8.5 GB/s 7.3 GB/s 5.3 GB/s 8.8 GB/s

Known Issues

Tag: 
Supercomputer: 
Service: 
Technologies: 

Known Issues for NCCL