Multi-Head Latent Attention (MLA)

Compressing KV cache via low-rank projections — the attention mechanism behind DeepSeek-V2/V3 and...

sabato 23 maggio 2026 New tab

2,105 words~10 min read

Compressing KV cache via low-rank projections — the attention mechanism behind DeepSeek-V2/V3 and Kimi K2.x

Why This Matters

Multi-Head Latent Attention (MLA) is the attention variant that replaces standard Multi-Head Attention (MHA) in DeepSeek-V2, DeepSeek-V3, and Kimi K2.x models. Instead of caching full KV pairs per head, MLA projects them into a low-dimensional latent space, achieving 5-10x KV cache compression with minimal quality loss.

MLA changes how prefix caching, chunked prefill, and paged attention must be implemented

Formal Definition

Multi-Head Latent Attention (MLA)

Multi-Head Latent Attention (MLA)

Other newsrooms on this story

Related reading

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed…

How sparse attention solves the memory bottleneck in long-context LLMs -…

Sparse KV Caches Cut Attention Scaling

Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

GML5 IndexCache

KV cache and PagedAttention: what they do and why they matter

Other newsrooms on this story

Related reading

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed…

How sparse attention solves the memory bottleneck in long-context LLMs -…

Sparse KV Caches Cut Attention Scaling

Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

GML5 IndexCache

KV cache and PagedAttention: what they do and why they matter