I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

In my last post, I laid out the core inequality of Speculative Decoding: a > 1 + α +...

domenica 28 giugno 2026 New tab

TL;DRAI

Speculative Decoding su CPU risulta 49-62% più lenta dell'autoregressive standard; 15-30% dei draft round sono rigettati completamente, neutralizzando guadagni teorici. La tecnica è GPU-bound — su GPU α<0.15, su CPU α≈0.3 — rendendo il draft model non competitivo in ambienti CPU-constraint.

1,598 words~7 min read

In my last post, I laid out the core inequality of Speculative Decoding:

a > 1 + α + β

Acceptance length a must exceed 1 plus the draft/target compute ratio α plus verification overhead β. If it does, SD wins. If it doesn't, SD loses.

That was theory. This post is the practice.

I ran a real A/B test on my machine. The results were worse than I expected — and more interesting.

I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

Other newsrooms on this story

Related reading

DeepSeek's DSpark Brings Speculative Decoding Back Into the Spotlight — Here's…

Speculative decoding: when and why it actually speeds up inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding…

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

The Speculative Decoding Pattern

Other newsrooms on this story

Related reading

DeepSeek's DSpark Brings Speculative Decoding Back Into the Spotlight — Here's…

Speculative decoding: when and why it actually speeds up inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding…

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

The Speculative Decoding Pattern