KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly 2 × 80 layers × 8 KV heads × 32768 tokens × 128 head_dim × 2 bytes ≈ 10.7 GB per request. Two hundred of those, and the H100s are paging to CPU. The model itself fits; the attention state doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks.

Here's how it works, what the formats are, and where the footguns hide.

Why this matters in practice

The KV cache is the largest dynamic piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with batch_size × seq_len and stays allocated until the request ends. On a long-context workload, it dominates.