Google brings multi-token prediction Gemma 4 LLMs - TechTalks

This article is part of our coverage of the latest in AI research.

Google just made a major upgrade to its Gemma 4 family of open-weight LLMs that significantly improves the inference speed of the models.

By integrating multi-token prediction (MTP) into its architecture, Gemma 4 breaks away from the traditional one-token-at-a-time approach, increasing throughput on consumer-grade hardware.

The efficiency of an LLM is often constrained by memory bandwidth rather than raw computational power. In standard autoregressive generation, a model predicts the next token, appends it to the sequence, and repeats the process. This requires moving the model’s massive weight matrices from memory to the processor for every single word. On local devices like MacBooks or PCs with limited VRAM speeds, this “memory wall” creates a hard limit on how fast the model can respond, regardless of the GPU’s clock speed.

Multi-token prediction changes this dynamic by predicting several tokens in parallel. Instead of asking the model “What is the next token?”MTP asks “What are the next n tokens?” By predicting multiple tokens at once, Gemma 4 leverages the parallel processing capabilities of modern GPUs more effectively, reducing the number of times the model weights must be fetched from memory. This results in a direct performance boost for the end user, making local interactions feel more instantaneous and fluid.

This article is part of our coverage of the latest in AI research.

Google just made a major upgrade to its Gemma 4 family of open-weight LLMs that significantly improves the inference speed of the models.

By integrating multi-token prediction (MTP) into its architecture, Gemma 4 breaks away from the traditional one-token-at-a-time approach, increasing throughput on consumer-grade hardware.

Google brings multi-token prediction Gemma 4 LLMs - TechTalks

Google brings multi-token prediction Gemma 4 LLMs - TechTalks

Other newsrooms on this story

Related reading

Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM

Google's new Gemma 4 open AI model is sized for your laptop

Google launches Gemma 4 AI model designed for laptops

Gemma 4 gets a stealth update that fixes tool calling bugs and truncated…

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes…

Gemma 4 QAT models: Optimizing model compression for mobile and laptop…

Other newsrooms on this story

Related reading

Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM

Google's new Gemma 4 open AI model is sized for your laptop

Google launches Gemma 4 AI model designed for laptops

Gemma 4 gets a stealth update that fixes tool calling bugs and truncated…

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes…

Gemma 4 QAT models: Optimizing model compression for mobile and laptop…