NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests.

‘Cold start’ means the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay increases the risk of SLA violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.

The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/image pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup.

To address this, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint/restore approach for AI inference workloads on Kubernetes.

https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/?linkId=100000423964029

To address this, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint/restore approach for AI inference workloads on Kubernetes.

https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/?linkId=100000423964029

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

Other newsrooms on this story

Related reading

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes |…

A Guide to AI Cold Starts on Cloud Run

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA…

NVIDIA Technical Blog

NVIDIA DSX Air Boosts Time to Token With Accelerated Simulation for AI Factories

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

Other newsrooms on this story

Related reading

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes |…

A Guide to AI Cold Starts on Cloud Run

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA…

NVIDIA Technical Blog

NVIDIA DSX Air Boosts Time to Token With Accelerated Simulation for AI Factories

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML