Speculative decoding has been the rumored 3-5x throughput multiplier for about 18 months. The numbers have stayed muddled because most of the public benchmarks ride on H100s with batch sizes greater than one, where the speedup gets folded into pricing tables nobody outside a serving team reads. What teams running a single workstation actually measure has been harder to find.
The BeeLlama v0.2.0 release pins down a specific point on that map. The setup is small enough to reproduce in a weekend: one RTX 3090, 32 GB of DDR4, a Ryzen 7 5700X3D, and llama.cpp build b9275 as the baseline. The two target models are Qwen 3.6 27B at Q5_K_S and Gemma 4 31B at the same quantization. The drafter for each is a Q4_K_M DFlash variant. The benchmark prompts and configs are pinned in the README and the GGUFs are on Hugging Face under Apache 2.0.
The Qwen row is the easier of the two to read. Baseline llama.cpp turns out 37.2 tokens per second on a ~1K-token completion task. BeeLlama's DFlash path runs the same prompt at a 163.9 tok/s median, with a best run of 181.9. That is a 4.40x median multiplier on a card that costs around $700 used. The Gemma 4 31B row reports an even larger ratio: 36.1 tok/s baseline against 177.8 tok/s median, a 4.93x multiplier on a model that is 15% larger than the Qwen. The pattern — bigger model, slightly more speedup — is consistent with what speculative decoding theory predicts, because the per-token cost is dominated by the target model's verification step and the drafter is much cheaper to run in either case.













