NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes. The family includes base, instruct, and vision-language variants.

Sequential Decoding Limits Throughput

Standard autoregressive (AR) language models generate text one token at a time, left to right. Each token depends on all previous tokens. This sequential dependency limits GPU parallelism per generation step. The result is low hardware utilization at low batch sizes — the typical setting for single-user or edge deployment.

Diffusion language models (LMs) offer a different approach. Instead of generating tokens sequentially, they denoise multiple tokens in parallel per forward pass. This enables higher throughput. The tradeoff has been accuracy: diffusion LMs have consistently lagged behind AR models on benchmarks, requiring substantially more data to reach comparable performance. A key reason is that diffusion training treats all token permutations uniformly, rather than leveraging the strong left-to-right prior inherent in natural language.

Sequential Decoding Limits Throughput

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

Other newsrooms on this story

Related reading

NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM…

Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the…

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model…

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language…

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…

Other newsrooms on this story

Related reading

NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM…

Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the…

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model…

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language…

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…