This article is part of our coverage of the latest in AI research.
Google just made a major upgrade to its Gemma 4 family of open-weight LLMs that significantly improves the inference speed of the models.
By integrating multi-token prediction (MTP) into its architecture, Gemma 4 breaks away from the traditional one-token-at-a-time approach, increasing throughput on consumer-grade hardware.
The efficiency of an LLM is often constrained by memory bandwidth rather than raw computational power. In standard autoregressive generation, a model predicts the next token, appends it to the sequence, and repeats the process. This requires moving the model’s massive weight matrices from memory to the processor for every single word. On local devices like MacBooks or PCs with limited VRAM speeds, this “memory wall” creates a hard limit on how fast the model can respond, regardless of the GPU’s clock speed.
Multi-token prediction changes this dynamic by predicting several tokens in parallel. Instead of asking the model “What is the next token?”MTP asks “What are the next n tokens?” By predicting multiple tokens at once, Gemma 4 leverages the parallel processing capabilities of modern GPUs more effectively, reducing the number of times the model weights must be fetched from memory. This results in a direct performance boost for the end user, making local interactions feel more instantaneous and fluid.






