The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
NCLL is available on OSC Clusters. The versions currently available at OSC are:
| Version | Pitzer | Ascend | Cardinal |
|---|---|---|---|
| 2.19.3-1 | X | X | X* |
* Current default version
You can use module spider nccl to view available modules for a given machine. Feel free to contact OSC Help if you need other versions for your work.
NCCL is available to all OSC users. If you have any questions, please contact OSC Help.
NVIDIA, see NVIDIA's links listed here for licensing.
The performance results were obtained by running NVIDIA NCCL Tests. The tests were built with NCCL 2.19.3, CUDA 12, and OpenMPI 5. Each performance value represents the average of five runs using a 512MB message size. The total number of ranks for each test was configured as follows:
-g $SLURM_GPUS_PER_NODE -t 1-g 2 -t 1srun -N 2 --ntasks-per-node=1 with -g 1 -t 1Note: For Ascend dual-GPU nodes, the environment variable NCCL_P2P_DISABLE was set to 1 due to a known issue.
| Cluster | Single Node | Node to Node | ||
|---|---|---|---|---|
| SendRecv | Allreduce | SendRecv | Allreduce | |
| Cardinal | 124 GB/s | 240 GB/s | 28.8 GB/s | 46.7 GB/s |
| Ascend (quad) | 72 GB/s | 144 GB/s | 6.3 GB/s | 6.3 GB/s |
| Ascend (dual) | 11.8 GB/s | 12.0 GB/s | 9.5 GB/s | 9.5 GB/s |
| Pitzer | 8.5 GB/s | 7.3 GB/s | 5.3 GB/s | 8.8 GB/s |