Google released an experimental model with open weights that generates text through diffusion instead of word by word. On a single GPU, it runs up to four times faster in single-user mode than classic language models. Nvidia handled the optimization.
Most language models generate one token after another, basing each new token on the previous one. DiffusionGemma takes a different approach. It starts with a block of 256 random placeholder tokens and refines them across several passes until readable text emerges. The idea comes from image AI, where diffusion models turn noise into clear images.
The model has 26 billion parameters total but only activates 3.8 billion per step. That's thanks to a mixture-of-experts architecture, where several specialized sub-networks sit side by side and only the right ones fire depending on the input. When quantized to lower precision, the model fits into 18 GB of VRAM on high-end consumer GPUs, according to Google. It builds on the Gemma 4 family and borrows its diffusion process from Google's earlier research on Gemini Diffusion.
DiffusionGemma generates far more tokens per second than the autoregressive Gemma 4 models but scores slightly lower on accuracy. | Image: Google










