The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

My MTP post showed multi-token prediction roughly doubling Qwen3.6-27B's generation on a 3090. A reader asked the question I'd skipped: what about prompt processing at long context? So I measured it — and that turns out to be the real wall, the one MTP can't climb.

TL;DR

On a single RTX 3090, prefill (prompt processing) for Qwen3.6-27B drops from ~1,575 tok/s at 1k context to ~852 at 128k — so a 64k-token prompt takes ~59 seconds before the first token appears, and 128k takes ~2.5 minutes. MTP speeds the decode phase, not prefill, so on a long-context / short-answer request (the typical RAG shape) its 2× generation win shrinks to ~3% of total latency. MTP is real; it just stops mattering exactly where long-context RAG lives.

Prefill vs context size

llama-bench, Qwen3.6-27B IQ4_XS, prefill only (-n 0), flash-attention on, single RTX 3090:

TL;DR

Prefill vs context size

llama-bench, Qwen3.6-27B IQ4_XS, prefill only (-n 0), flash-attention on, single RTX 3090:

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

Other newsrooms on this story

Related reading

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is…

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Other newsrooms on this story

Related reading

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is…

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching