Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus | NVIDIA Technical Blog

Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down, it becomes challenging to determine why and what to do next. A problem can span computation, communication, a specific rank, or underlying hardware.

NVIDIA NCCL Inspector accelerates triaging by providing a lightweight and continuous report of NCCL communication performance. It tracks operation type, size, and bandwidth across every rank, and with this latest enhancement, can facilitate real-time analysis with minimal overhead.

It also helps determine the optimal training recipe. A previous post introduced NCCL Inspector offline mode. While fine-grained analysis remains the standard for deep-dive data, this post introduces real-time monitoring, a new feature. Live, time-series visualizations can now be powered directly within a user’s infrastructure dashboard by integrating NCCL Inspector with Prometheus Exporter.

NCCL Inspector deployment architecture

NCCL 2.30 introduces Prometheus Mode, a major enhancement for real-time performance monitoring of NCCL in AI workloads. The NCCL Inspector works in two modes, shown in Figures 1 and 2.

NCCL Inspector deployment architecture

NCCL 2.30 introduces Prometheus Mode, a major enhancement for real-time performance monitoring of NCCL in AI workloads. The NCCL Inspector works in two modes, shown in Figures 1 and 2.

Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus | NVIDIA Technical Blog

Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus | NVIDIA Technical Blog

Related reading

CCCL Runtime: A Modern C++ Runtime for CUDA | NVIDIA Technical Blog

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer…

NVIDIA Technical Blog

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX…

Introducing NVIDIA Fleet Intelligence for Real-Time GPU Fleet Visibility and…

Cut Checkpoint Costs with About 30 Lines of Python and NVIDIA nvCOMP | NVIDIA…

Related reading

CCCL Runtime: A Modern C++ Runtime for CUDA | NVIDIA Technical Blog

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer…

NVIDIA Technical Blog

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX…

Introducing NVIDIA Fleet Intelligence for Real-Time GPU Fleet Visibility and…

Cut Checkpoint Costs with About 30 Lines of Python and NVIDIA nvCOMP | NVIDIA…