Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation | NVIDIA Technical Blog

Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This limits responsiveness, increases serving costs, and makes fluid, interactive experiences difficult to achieve.

DiffusionGemma, created by Google DeepMind and optimized to run efficiently across NVIDIA platforms, introduces a new approach to text generation, producing tokens in parallel rather than one at a time, enabling faster, higher-throughput AI applications. The model uses diffusion-based denoising to generate 256 tokens in parallel per step, delivering up to 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, up to 150 tokens/sec on NVIDIA DGX Spark, and the fastest local performance on NVIDIA DGX Station.

For enterprise developers, this speed translates into lower serving costs, higher concurrency, and more responsive user experiences without sacrificing model quality. DiffusionGemma is built on the Gemma 4 26B A4B MoE architecture and optimized for low-latency, memory-bound inference.

Model name DiffusionGemma Supported modalities Text, image Total parameters 25.2B Active parameters 3.8B Context length Up to 256K tokens Precision format BF16, NVFP4 Table 1. Overview of the DiffusionGemma, summarizing modalities, parameter sizes, and supported context length

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

Google launches DiffusionGemma open model for faster local AI workflows

Google's DiffusionGemma runs text 4x faster

Google's new open model DiffusionGemma generates text from noise instead of…

Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion…

DiffusionGemma: 4x faster text generation