NVIDIA just released Nemotron-Labs Diffusion: a family of open-weight language models (3B, 8B, 14B, plus an 8B VLM) that can run in three distinct generation modes from the same checkpoint — autoregressive, diffusion, or self-speculative — with no application-level changes required. The headline number: 6.4× higher token throughput versus standard autoregressive decoding, with accuracy that matches or beats Qwen3 8B on benchmarks.
"Autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model."
What actually changed
Autoregressive LLMs have a hard constraint: one token at a time, every token a full model pass. That's fine for quality but brutal for throughput at low batch sizes — the GPU spends most of its time on memory ops, not compute.
Nemotron-Labs Diffusion breaks that constraint by adding parallel drafting on top of a pretrained AR model (rather than training a diffusion model from scratch). Three modes, switchable at deploy time:









