Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling | NVIDIA Technical Blog

The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18 tightly coupled compute trays, massive GPU fabrics, and high-bandwidth networking packaged as a unit.

For AI architects and HPC platform operators, the challenge isn’t just racking and stacking hardware—it’s turning infrastructure into safe, performant, and easy-to-use resources for end users. The mismatch between rack-scale hardware topology and scheduler abstractions is where most of the operational complexity lives. Left unaddressed, schedulers operate on a flat pool of GPUs and nodes, overlooking the system’s hierarchical and topology-sensitive design.

This is the gap that a validated software stack, such as NVIDIA Mission Control, is designed to bridge. Mission Control provides ‌rack-scale control planes for NVIDIA Grace Blackwell NVL72 systems. With a native understanding of NVIDIA NVLink and NVIDIA IMEX domains, it integrates with workload management platforms like Slurm and NVIDIA Run:ai. These capabilities will also be supported for the NVIDIA Vera Rubin platform, including for NVIDIA Rubin NVL8.

This post demonstrates how Mission Control, Slurm, and NVIDIA Run:ai turn advanced GPU architecture concepts—such as NVLink and IMEX domains—into an operational AI factory that is scalable, schedulable, and easy to manage.

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling | NVIDIA Technical Blog

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling | NVIDIA Technical Blog

Related reading

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm…

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job…

Scaling the AI-Ready Data Center with NVIDIA RTX PRO 4500 Blackwell Server…

Networking Archives

Rack-scale infrastructure key to the AI factory era - SiliconANGLE

Nvidia's GB300 NVL72 achieves 61.4K concurrent agents per megawatt, a 20x leap…

Related reading

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm…

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job…

Scaling the AI-Ready Data Center with NVIDIA RTX PRO 4500 Blackwell Server…

Networking Archives

Rack-scale infrastructure key to the AI factory era - SiliconANGLE

Nvidia's GB300 NVL72 achieves 61.4K concurrent agents per megawatt, a 20x leap…