DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

TL;DR: Google released DiffusionGemma, an open Apache 2.0 diffusion-based LLM that generates text up to 4x faster than autoregressive models, hitting 1,000+ tokens/sec on a single H100 and fitting in 18 GB VRAM. It trades some accuracy for speed. Here is what that means in practice.

What DiffusionGemma Actually Is

Google DeepMind released DiffusionGemma, the first production-grade open-weight model that applies discrete diffusion to text generation. The same family of techniques behind image generators like Stable Diffusion, now applied to language.

Instead of predicting one token at a time left-to-right, DiffusionGemma fills a 256-token block with noise and iteratively refines the entire block across multiple denoising passes until confidence thresholds are met. It commits roughly 15-20 tokens per forward pass on average, not one.

This is a fundamentally different compute pattern from everything shipping in production today.

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

Other newsrooms on this story

Related reading

Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free - Decrypt

Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion…

Google's new open model DiffusionGemma generates text from noise instead of…

Google launches DiffusionGemma open model for faster local AI workflows

Google open-sources speedy DiffusionGemma text diffusion model - SiliconANGLE

Google's latest DiffusionGemma open AI model comes with a 4x speed boost