Introduction

Speculative decoding is one of those techniques that has been "almost ready for production" for the better part of three years. A small draft model proposes tokens; a larger target model verifies them in a single forward pass. In theory, you get 2–4× throughput. In practice, the draft model has to be cheap, fast, and good enough at mimicking the target's distribution, which is a much harder combination than it sounds.

Yesterday, a new paper from DeepSeek quietly climbed to the top of Hacker News (714+ points, 290+ comments at the time of writing). It's called DSpark, and it reframes speculative decoding in a way that looks like it could finally make the technique drop-in rather than bolt-on.

The paper is here: github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

The Core Idea