Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right, the 2.25× is real, and below is the exact path that got me there on my box.

TL;DR

On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to 80.2 tok/s (llama.cpp + MTP) — a measured 2.25× — by stacking three independent levers: a leaner engine, a smaller quant, and speculative decoding. The interesting part isn't the headline; it's which lever bought how much, and a couple of things that tripped me up on the way. (To be precise up front: MTP on its own is 1.78× at the same quant — the 2.25× is what you get when all three levers stack.)

The lever table

All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on:

TL;DR

The lever table

All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on:

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

Other newsrooms on this story

Related reading

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B,…

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is…

Qwen Is Not Yet Ready to Power Local OpenClaw Deployments

Other newsrooms on this story

Related reading

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B,…

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is…

Qwen Is Not Yet Ready to Power Local OpenClaw Deployments