Speculative decoding: how it works & when to use it

You're running LLM inference in production. Semantic caching handles the easy wins: repeated queries with the same intent come back from cache without touching the model. But everything else still hits the model at full cost, and that adds up fast at scale.Speculative decoding is one way to speed up those requests without touching the model or its outputs. This guide covers how it works, the variants gaining traction in 2026, when it helps, when it doesn't, and where it fits in a layered inference stack alongside semantic caching.What is speculative decodingSpeculative decoding makes LLM responses faster without changing what the model outputs. Instead of generating one token at a time, a small draft model runs ahead and proposes several tokens. The large model checks those proposals in a single pass. When they're accepted, you get multiple tokens for roughly the cost of one. The output stays identical to running the large model alone, so there's no quality tradeoff.How much faster depends on how often the large model agrees with the draft. When the two align closely, the gains are meaningful. When they diverge, the overhead of running the draft model eats into the benefit. Early experiments on large models measured 2–2.5× speedups in distributed setups, but results vary widely depending on model pairing and workload.The memory-bandwidth problem with autoregressive decodingLLMs generate text slowly not because the math is hard, but because of how much data has to move. Every token your model generates requires loading the entire set of model weights from memory (tens of gigabytes for a large model), doing a small amount of arithmetic on them, and then producing a single token. Then repeating the whole thing for the next token.The GPU's arithmetic units finish their work fast. The bottleneck is the memory transfer: fetching all those weights takes far longer than the actual computation. So the GPU sits partially idle between tokens, waiting on memory reads rather than doing useful work.That idle compute is the gap speculative decoding exploits. It's also why semantic caching complements it: for repeated or semantically similar queries, no token generation runs at all. The answer is returned from cache. Speculative decoding speeds up the requests that do make it to the model.Make your AI apps faster and cheaperCut costs by up to 90% and lower latency with semantic caching powered by Redis.How the draft-verify loop worksThe core idea: run a cheap model ahead of the expensive one, then check its work in bulk. Here's how that plays out in practice.A small draft model generates several candidate tokens from the current context. This is fast and inexpensive compared to running the large model. The large model then checks all those candidates in a single pass. As covered above, that check costs roughly the same as generating one token the normal way.Each candidate either passes or fails the check. The first failure stops the loop, and the large model corrects from that point. If all candidates pass, the large model adds one more token on top. The result: between one and several tokens produced per large-model pass, depending on how well the draft and target models agree.The output is provably identical to what the large model would have generated on its own. Speculative decoding is a latency optimization, not an approximation.Variants gaining traction in 2026The base algorithm has spawned a range of variants, each solving a different limitation: draft model overhead, memory footprint, workload structure, or hardware constraints. Here are the ones showing up in recent research and production frameworks.EAGLE-3: higher acceptance ratesEAGLE-3 is a speculative decoding method that improves how well the draft model predicts what the large model will output. Better predictions mean more draft tokens get accepted, and more accepted tokens means more output per large-model pass. The EAGLE family achieves this by attaching a lightweight head directly to the target model rather than training a separate model from scratch. EAGLE-3 advances on earlier versions by predicting tokens directly (rather than intermediate model features) and drawing on representations from multiple layers of the target model rather than just the top one. Across tested configurations, EAGLE-3 reported 3.0–6.5× speedups over standard generation, with a 20–40% improvement over EAGLE-2. A newer extension, P-EAGLE, pushes further by generating all draft tokens in a single pass, reporting 1.10–1.36× additional gains over EAGLE-3 in large-scale model benchmarks.SuffixDecoding: training-free for agentic workloadsSuffixDecoding is useful when your workload produces repetitive, structured outputs — multi-step SQL generation, tool-calling loops, or code pipelines where the same patterns appear across requests. Instead of training a draft model, it builds draft candidates by matching the current generation against a history of past outputs. It maintains two pools of that history: one for the current request, and a shared pool across all prior requests. The shared pool drives most of the speedup, because cross-request repetition is where the gains are. No model training required means it can be layered onto an existing inference stack. SuffixDecoding reported a mean 5.3× speedup on AgenticSQL benchmarks.Give your AI apps real-time contextRun them on Redis for AI, built for fast retrieval and low-latency responses.

Speculative decoding: how it works & when to use it

Speculative decoding: how it works & when to use it

Other newsrooms on this story

Related reading

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Speculative decoding: when and why it actually speeds up inference

Speculative decoding for high-throughput long-context inference

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding…

DeepSeek's DSpark Brings Speculative Decoding Back Into the Spotlight — Here's…

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Other newsrooms on this story

Related reading

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Speculative decoding: when and why it actually speeds up inference

Speculative decoding for high-throughput long-context inference

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding…

DeepSeek's DSpark Brings Speculative Decoding Back Into the Spotlight — Here's…

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding