KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops working. With BF16, the KV cache alone for a 70B Llama-3 at 32k context is roughly 2 × 80 layers × 8 KV heads × 32768 tokens × 128 head_dim × 2 bytes ≈ 10.7 GB per request. Two hundred of those, and the H100s are paging to CPU. The model itself fits; the attention state doesn't. This is the problem KV cache quantization is built for, and it's the natural follow-up to last week's piece on speculative decoding — because the two features interact in ways that don't always show up in vendor benchmarks.

Here's how it works, what the formats are, and where the footguns hide.

Why this matters in practice

The KV cache is the largest dynamic piece of memory in a serving LLM. The model weights are fixed at load time. The activations get freed after each forward pass. The KV cache grows with batch_size × seq_len and stays allocated until the request ends. On a long-context workload, it dominates.

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

Here's how it works, what the formats are, and where the footguns hide.

Why this matters in practice

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

Other newsrooms on this story

Related reading

Understanding and Coding the KV Cache in LLMs from Scratch

AI/ML Research Digest — May 23, 2026

KV Cache Explained Like You're an LLM Engineer

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

Flash Attention: what it does and why it matters

Why most LLM VRAM calculators are wrong on modern models (and an open-source…

Other newsrooms on this story

Related reading

Understanding and Coding the KV Cache in LLMs from Scratch

AI/ML Research Digest — May 23, 2026

KV Cache Explained Like You're an LLM Engineer

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

Flash Attention: what it does and why it matters

Why most LLM VRAM calculators are wrong on modern models (and an open-source…