MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

In my MTP post, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting...

giovedì 11 giugno 2026 New tab

680 words~3 min read

In my MTP post, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting to read that as "turn on MTP, go faster." So I measured it on a different model — Gemma 4 12B QAT — and it's a big win on my 3090. But the same model with the same MTP draft runs slower on an M1 Max. MTP isn't a free switch; it's a hardware-dependent lever.

My 3090 numbers

Gemma 4 12B QAT (UD-Q4_K_XL) + an MTP draft head (Q8_0-MTP, a 0.47 GB nextn head, not a full second model), single RTX 3090, decode tok/s, 3 runs each:

config

mean tok/s

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

Other newsrooms on this story

Related reading

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B,…

Comparing Model Performance: Without MTP vs. With MTP vs. With MTP + QAT

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas,…

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Other newsrooms on this story

Related reading

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B,…

Comparing Model Performance: Without MTP vs. With MTP vs. With MTP + QAT

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas,…

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090