Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Google shipped Gemma 4 12B this week — a model that packs near-26B performance into something that runs on a consumer laptop with 16GB of RAM or unified memory. That alone would be notable. But the more significant move is the architecture: no multimodal encoders at all. Vision and audio go straight into the LLM backbone.

"Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs." — Google DeepMind

What actually changed

Encoder-free multimodal: Traditional multimodal models pipe images and audio through separate encoder networks before the LLM ever sees them. Gemma 4 12B removes those entirely. Vision gets a lightweight embedding module (a single matrix multiplication + positional embedding). Audio skips encoding altogether — the raw signal is projected directly into the same token space as text.

Near-26B benchmark performance at half the footprint: On standard benchmarks it runs neck-and-neck with Gemma 4 26B, and actually surpasses it on DocVQA (document visual question answering).

"Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs." — Google DeepMind

What actually changed

Near-26B benchmark performance at half the footprint: On standard benchmarks it runs neck-and-neck with Gemma 4 26B, and actually surpasses it on DocVQA (document visual question answering).

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Other newsrooms on this story

Related reading

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with…

Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16…

Gemma 4 12B Is Google's Biggest Bet on Local Multimodal AI Yet

Gemma 4 12B: The Developer Guide- Google Developers Blog

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google's new Gemma 4 open AI model is sized for your laptop

Related reading

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with…

Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16…

Gemma 4 12B Is Google's Biggest Bet on Local Multimodal AI Yet

Gemma 4 12B: The Developer Guide- Google Developers Blog

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google's new Gemma 4 open AI model is sized for your laptop

Other newsrooms on this story