Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding | NVIDIA Technical Blog

As AI systems move from single-turn interactions to coordinated multiagent workflows, low-latency inference becomes increasingly important. Autoregressive LLMs generate tokens sequentially, which can limit GPU utilization and constrain throughput in latency-sensitive serving scenarios.

Speculative decoding helps mitigate this bottleneck by using a lightweight model to draft future tokens, which the larger target model then verifies in parallel. DFlash is an open source lightweight block diffusion model designed for speculative decoding that extends this approach with a block-diffusion drafter. This drafter generates an entire block of candidate tokens in a single forward pass, turning sequential drafting into block-parallel GPU work while preserving the target model’s output quality through verification.

DFlash increases inference performance for gpt-oss-120b on NVIDIA Blackwell by up to 15x at the same interactivity level. It nearly doubles interactivity for Llama 3.1 8B at the same concurrency compared with state-of-the-art EAGLE-3 speculative decoding.

DFlash is also moving quickly from research into developer workflows. The research team has released 20 DFlash checkpoints on Hugging Face with recipes for NVIDIA Blackwell and NVIDIA Hopper GPUs.

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding | NVIDIA Technical Blog

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x…

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to…

Speculative decoding: when and why it actually speeds up inference

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Speculative decoding for high-throughput long-context inference

Other newsrooms on this story

Related reading

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x…

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to…

Speculative decoding: when and why it actually speeds up inference

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Speculative decoding for high-throughput long-context inference