BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Speculative decoding has been the rumored 3-5x throughput multiplier for about 18 months. The numbers have stayed muddled because most of the public benchmarks ride on H100s with batch sizes greater than one, where the speedup gets folded into pricing tables nobody outside a serving team reads. What teams running a single workstation actually measure has been harder to find.

The BeeLlama v0.2.0 release pins down a specific point on that map. The setup is small enough to reproduce in a weekend: one RTX 3090, 32 GB of DDR4, a Ryzen 7 5700X3D, and llama.cpp build b9275 as the baseline. The two target models are Qwen 3.6 27B at Q5_K_S and Gemma 4 31B at the same quantization. The drafter for each is a Q4_K_M DFlash variant. The benchmark prompts and configs are pinned in the README and the GGUFs are on Hugging Face under Apache 2.0.

The Qwen row is the easier of the two to read. Baseline llama.cpp turns out 37.2 tokens per second on a ~1K-token completion task. BeeLlama's DFlash path runs the same prompt at a 163.9 tok/s median, with a best run of 181.9. That is a 4.40x median multiplier on a card that costs around $700 used. The Gemma 4 31B row reports an even larger ratio: 36.1 tok/s baseline against 177.8 tok/s median, a 4.93x multiplier on a model that is 15% larger than the Qwen. The pattern — bigger model, slightly more speedup — is consistent with what speculative decoding theory predicts, because the per-token cost is dominated by the target model's verification step and the drafter is much cheaper to run in either case.

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Other newsrooms on this story

Related reading

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

Speculative decoding: when and why it actually speeds up inference

RTX 5090 Cooling, BeeLlama VRAM Opts, Resizable BAR Performance Gains

Speculative Decoding: 20-50% Faster LLM Inference

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark

Other newsrooms on this story

Related reading

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever…

Speculative decoding: when and why it actually speeds up inference

RTX 5090 Cooling, BeeLlama VRAM Opts, Resizable BAR Performance Gains

Speculative Decoding: 20-50% Faster LLM Inference

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark