Google Releases DiffusionGemma: Parallel Block Decoding

What: Google released DiffusionGemma, an open-weight model whose headline trick is parallel block decoding — it writes text by refining a whole block of tokens at once through iterative denoising, instead of predicting one next token at a time.

Why: Decoding is the slow, sequential part of running an LLM: emitting N tokens normally costs N forward passes that each wait on the last. Laying down a block in parallel is why DiffusionGemma reports up to 4x faster decode and 1000+ tokens/sec on an H100.

vs prior: Versus standard autoregressive decoding — left-to-right, one token per forward pass under a causal mask — DiffusionGemma starts from a canvas of 256 placeholder tokens and refines them all at once with bidirectional attention, so it can revise an early token using later context.

Think of it as

a Polaroid photo developing all at once vs a printer typing left to right

Google Releases DiffusionGemma: Parallel Block Decoding

Other newsrooms on this story

Related reading

Google unveils DiffusionGemma, an AI model that breaks free of left-to-right…

Other newsrooms on this story

Related reading

Google unveils DiffusionGemma, an AI model that breaks free of left-to-right…

Google launches DiffusionGemma open model for faster local AI workflows

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes…

Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion…

Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free - Decrypt

Google's DiffusionGemma runs text 4x faster