You have probably noticed that ChatGPT or Claude streams words to your screen almost instantly. But behind the scenes, generating each word requires a massive model to perform billions of computations. So how do these systems feel so fast?
One of the key answers is a technique called speculative decoding — an inference optimization that makes large language models generate text significantly faster without changing a single word of their output.
First — Why is Text generation slow?
To understand speculative decoding, you need to understand one fundamental constraint of large language models.
They generate text one token at a time.











