One of the hottest topics in LLM inference acceleration right now is Speculative Decoding.

DSpark claims 60%–85% single-user speedup at the same throughput. Google has published a stream of research on it — SpecTr, block verification, SpecRouter, and more.

Sounds great, right? A small model (draft model) writes a draft, the large model batch-verifies it, and speed goes up.

But if you're a production engineer looking at this, two questions immediately pop up:

"Block generation — doesn't that amplify hallucinations?"