Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are the same on every request — the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic — because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix.
Here's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit.
Why this matters in practice
A modern LLM serving stack has two phases per request: prefill (process the entire prompt to build the KV cache) and decode (generate one token at a time, attending against the growing cache). For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70–85% of TTFT — decode is fast in comparison.








