My MTP post showed multi-token prediction roughly doubling Qwen3.6-27B's generation on a 3090. A reader asked the question I'd skipped: what about prompt processing at long context? So I measured it — and that turns out to be the real wall, the one MTP can't climb.

TL;DR

On a single RTX 3090, prefill (prompt processing) for Qwen3.6-27B drops from ~1,575 tok/s at 1k context to ~852 at 128k — so a 64k-token prompt takes ~59 seconds before the first token appears, and 128k takes ~2.5 minutes. MTP speeds the decode phase, not prefill, so on a long-context / short-answer request (the typical RAG shape) its 2× generation win shrinks to ~3% of total latency. MTP is real; it just stops mattering exactly where long-context RAG lives.

Prefill vs context size

llama-bench, Qwen3.6-27B IQ4_XS, prefill only (-n 0), flash-attention on, single RTX 3090: