One of the hottest topics in LLM inference acceleration right now is Speculative Decoding.
DSpark claims 60%–85% single-user speedup at the same throughput. Google has published a stream of research on it — SpecTr, block verification, SpecRouter, and more.
Sounds great, right? A small model (draft model) writes a draft, the large model batch-verifies it, and speed goes up.
But if you're a production engineer looking at this, two questions immediately pop up:
"Block generation — doesn't that amplify hallucinations?"












