Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations running large-scale AI training have years of investment in Slurm job scripts, fair-share policies, and accounting workflows. The challenge is getting Slurm scheduling capabilities onto Kubernetes—the standard platform for managing GPU infrastructure at scale—without maintaining two separate environments.

Slinky, an open source project developed by SchedMD (now part of NVIDIA), takes two approaches to this integration:

slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods

slurm-operator runs full Slurm clusters on Kubernetes infrastructure, managing the complete lifecycle of Slurm daemons as pods

This post focuses on the slurm-operator, which is how NVIDIA runs Slurm on Kubernetes for large-scale GPU training clusters. It walks through the architecture of the operator and how it maps Slurm daemons to Kubernetes primitives, then covers deployment—including how Slinky slurm-operator integrates with your existing infrastructure. It also covers the Kubernetes ecosystem integrations that make this model practical. Finally, we share lessons from running Slinky in production at NVIDIA on clusters with over 1,000 GPU worker nodes and 8,000+ GPUs.