A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right, the 2.25× is real, and below is the exact path that got me there on my box.
TL;DR
On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to 80.2 tok/s (llama.cpp + MTP) — a measured 2.25× — by stacking three independent levers: a leaner engine, a smaller quant, and speculative decoding. The interesting part isn't the headline; it's which lever bought how much, and a couple of things that tripped me up on the way. (To be precise up front: MTP on its own is 1.78× at the same quant — the 2.25× is what you get when all three levers stack.)
The lever table
All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on:








