In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests.

‘Cold start’ means the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay increases the risk of SLA violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.

The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/image pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup.

To address this, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint/restore approach for AI inference workloads on Kubernetes.

https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/?linkId=100000423964029