NVIDIA Collective Communications Library (NCCL)

Multi-GPU and multi-node collective communication primitives

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that are optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect.

(Click to Zoom)

Developers of deep learning frameworks can rely on NCCL’s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. Leading deep learning frameworks such as Caffe,Caffe2, Chainer, MxNet, TensorFlow, and PyTorch have integrated NCCL to accelerate deep learning training on multi-GPU systems.

We strive to bring the best experiences to the developer community, as a result we have made NCCL 2.3 and later open source. This enables us to have open discussions with the developer community as we continue to build a great product. The source code for NCCL is available on GitHub and NCCL binaries can be downloaded from NVIDIA Developer Zone.

What’s New in NCCL 2.5

NCCL 2.5 highlights include:

  • Improved efficiency at an even larger scale than before (tens of thousands of GPUs)
  • Improved topology detection and tree/ring creation
  • Model-based tuning to switch between the different algorithms and protocols

Read the latest NCCL release notes for a detailed list of new features and enhancements.

Transformer network Benchmark, Batch Size=640, Overlap 0.20, 32xDGX1V + 4xMellanox CX-6 GNMT network Benchmark, Batch Size = 32, Overlap 0.15, 24xDGX1V + 4xMellanox CX-6
32xDGX1V + 4xMellanox CX-6, NCCL AllReduce BusBW

Key Features

  • Support multi-threaded and multi-process applications
  • Faster training of newer and deeper models with aggregated inter-GPU reduction operations.
  • Multiple ring formations for high bus utilization.
  • Tree algorithm implementation for large scale multi-GPU and multi-node training reducing latency.
  • Support for InfiniBand verbs, libfabric, RoCE and IP Socket internode communication

Additional Resources