The cold-start problem
In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However, cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests.
This delay increases the risk of service level agreement (SLA) violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.
For a single-GPU vLLM (v0.20.0) workload, the cold-start latency breaks down as follows:
Figure 1. Cold-Start Latency Breakdown for a Single-GPU Inference Worker











