NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

NVIDIA just released Nemotron-Labs Diffusion: a family of open-weight language models (3B, 8B, 14B, plus an 8B VLM) that can run in three distinct generation modes from the same checkpoint — autoregressive, diffusion, or self-speculative — with no application-level changes required. The headline number: 6.4× higher token throughput versus standard autoregressive decoding, with accuracy that matches or beats Qwen3 8B on benchmarks.

"Autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model."

What actually changed

Autoregressive LLMs have a hard constraint: one token at a time, every token a full model pass. That's fine for quality but brutal for throughput at low batch sizes — the GPU spends most of its time on memory ops, not compute.

Nemotron-Labs Diffusion breaks that constraint by adding parallel drafting on top of a pretrained AR model (rather than training a diffusion model from scratch). Three modes, switchable at deploy time:

"Autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model."

What actually changed

NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

Other newsrooms on this story

Related reading

Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM…

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model…

Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the…

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language…

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…

Other newsrooms on this story

Related reading

Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM…

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model…

Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the…

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language…

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…