Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.

For a broader view of token speeds and VRAM trade-offs across more models on the same hardware, see 16 GB VRAM LLM benchmarks with llama.cpp.

What MTP (Multi-Token Prediction) Is

Multi-Token Prediction is a form of speculative decoding built directly into certain model checkpoints. Instead of predicting one token per forward pass, the model carries extra "MTP heads" that propose several future tokens in a single step — then verifies them in parallel. If the guesses are accepted, the effective throughput rises without changing the output quality.

The Qwen 3.6 family ships both standard GGUF files and MTP-enabled variants. In llama.cpp, MTP is activated through:

I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.

For a broader view of token speeds and VRAM trade-offs across more models on the same hardware, see 16 GB VRAM LLM benchmarks with llama.cpp.

What MTP (Multi-Token Prediction) Is

The Qwen 3.6 family ships both standard GGUF files and MTP-enabled variants. In llama.cpp, MTP is activated through:

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Other newsrooms on this story

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Other newsrooms on this story

Related reading

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B,…

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is…

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

Adding GPU backends to a pure-C TTS engine: Metal, CUDA, and the rented-Mac…

Related reading

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B,…

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is…

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

Adding GPU backends to a pure-C TTS engine: Metal, CUDA, and the rented-Mac…