This is a submission for the Gemma 4 Challenge: Write About Gemma 4

The road to vision capabilities in the Gemma family has been an interesting one. The first and second generations of Gemma models did not include native vision support. Instead, multimodal capabilities were introduced through the PaliGemma models.

It wasn’t until Gemma 3 that we began to see native vision capabilities integrated directly into the Gemma. Even then, those capabilities were reserved for the larger variants. With Gemma 4, that changes. Every variant in the series can now see.

Gemma 4’s vision system is also a significant step forward from Gemma 3. It introduces several new ideas and challenges the common approach of representing images as fixed 16×16 words. Instead, Gemma 4 processes images using 48×48 soft tokens, a design that fundamentally changes how visual information is represented within the model.

In this article, we’ll take a deep dive into how Gemma 4’s vision capabilities work. Along the way, we’ll explore the architectural decisions behind the model and build an intuition for why the Google DeepMind team made certain design choices for this release of Gemma.