Why Your LLM Doesn't Re-Read the Prompt: The KV-Cache

The KV-cache is the single most important optimisation in LLM inference — and the reason real-time chat with a model is even feasible. Here's what it is and why it matters.

Generation is autoregressive

An LLM produces text one token at a time: emit a token, append it, run the whole model again for the next. Inside each attention layer, every token becomes a Query, a Key, and a Value. To produce the newest token, its Query is scored against the Keys of all previous tokens, and those weights blend their Values. So generating token t needs the K and V of tokens 1…t.

The naïve approach is quadratic

Without a cache, each step re-encodes the entire prefix to rebuild K/V for tokens 1…t. Step 1 processes 1 token, step 2 processes 2, …, step N processes N. Total work ≈ 1+2+…+N = N(N+1)/2 — quadratic. Token 1's K/V gets recomputed on every single step even though it never changes.

The KV-cache is the single most important optimisation in LLM inference — and the reason real-time chat with a model is even feasible. Here's what it is and why it matters.

Generation is autoregressive

The naïve approach is quadratic

Why Your LLM Doesn't Re-Read the Prompt: The KV-Cache

Why Your LLM Doesn't Re-Read the Prompt: The KV-Cache

Related reading

Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

Understanding and Coding the KV Cache in LLMs from Scratch

KV Cache Explained Like You're an LLM Engineer

AI 101: From Tokens to Answers: What Actually Happens During LLM Inference

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference…

How sparse attention solves the memory bottleneck in long-context LLMs -…

Related reading

Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

Understanding and Coding the KV Cache in LLMs from Scratch

KV Cache Explained Like You're an LLM Engineer

AI 101: From Tokens to Answers: What Actually Happens During LLM Inference

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference…

How sparse attention solves the memory bottleneck in long-context LLMs -…