TL;DRAI

Gemma 4 adds native vision to every variant using 48×48 soft tokens — 9 patch embeddings pooled into one, reducing visual token count 9×. The compression cuts inference cost and context length, making high-resolution multimodal models viable on edge and mobile hardware.

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

The road to vision capabilities in the Gemma family has been an interesting one. The first and second generations of Gemma models did not include native vision support. Instead, multimodal capabilities were introduced through the PaliGemma models.

It wasn’t until Gemma 3 that we began to see native vision capabilities integrated directly into the Gemma. Even then, those capabilities were reserved for the larger variants. With Gemma 4, that changes. Every variant in the series can now see.

Gemma 4’s vision system is also a significant step forward from Gemma 3. It introduces several new ideas and challenges the common approach of representing images as fixed 16×16 words. Instead, Gemma 4 processes images using 48×48 soft tokens, a design that fundamentally changes how visual information is represented within the model.

In this article, we’ll take a deep dive into how Gemma 4’s vision capabilities work. Along the way, we’ll explore the architectural decisions behind the model and build an intuition for why the Google DeepMind team made certain design choices for this release of Gemma.

dev.to

Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀

This is a submission for the Gemma 4 Challenge: Write About Gemma 4 The road to vision capabilities...

lunedì 25 maggio 2026 New tab

TL;DRAI

4,334 words~20 min read

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀

Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀

Related reading

Your Laptop Just Got Smarter: A Complete Guide to Gemma 4's Four Models

Gemma 4 Is Not Just Another Open Model — It Changes What Developers Can Build…

Why Gemma 4 Feels Like an Important Moment for AI Developers✨

I Gave Gemma 4 150 Tools on Windows. Here's What Actually Happened.

I Fine-Tuned Gemma 4 on an Emotion Dataset Using a Single GPU

I was trying to Learning About Gemma 4 and It was pretty good

Related reading

Your Laptop Just Got Smarter: A Complete Guide to Gemma 4's Four Models

Gemma 4 Is Not Just Another Open Model — It Changes What Developers Can Build…

Why Gemma 4 Feels Like an Important Moment for AI Developers✨

I Gave Gemma 4 150 Tools on Windows. Here's What Actually Happened.

I Fine-Tuned Gemma 4 on an Emotion Dataset Using a Single GPU

I was trying to Learning About Gemma 4 and It was pretty good