DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Autoregressive large language models generate text one token at a time. Each token waits for the one before it. This serial loop leaves modern GPUs underused and keeps inference slow. The cost grows worse with long Chain-of-Thought reasoning models. Their lengthy outputs make latency the dominant part of generation.

Speculative decoding is the standard fix. A small draft model proposes future tokens. The large target model verifies those tokens in parallel. Accepted tokens are kept, so the output stays lossless. But most methods, including the state-of-the-art EAGLE-3, still draft autoregressively. That serial drafting caps real-world speedups near 2–3×.

DFlash, introduced by research team from UC San Diego team (z-lab), takes a different route. It is a lightweight block diffusion model built for drafting. Instead of drafting tokens one at a time, it proposes a whole block in a single forward pass. The target model then verifies that block in parallel.

The research team reports over 6× lossless acceleration across a range of models and tasks. It reaches up to 2.5× higher speedup than EAGLE-3. On NVIDIA Blackwell, NVIDIA engineering team reports up to 15× higher throughput for gpt-oss-120b. That figure holds at the same user interactivity target.

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Other newsrooms on this story

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Other newsrooms on this story

Related reading

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash…

Speculative decoding: when and why it actually speeds up inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Speculative decoding for high-throughput long-context inference

Flash-Decoding for long-context inference

Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output

Related reading

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash…

Speculative decoding: when and why it actually speeds up inference

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Speculative decoding for high-throughput long-context inference

Flash-Decoding for long-context inference

Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output