Google shipped Gemma 4 12B this week — a model that packs near-26B performance into something that runs on a consumer laptop with 16GB of RAM or unified memory. That alone would be notable. But the more significant move is the architecture: no multimodal encoders at all. Vision and audio go straight into the LLM backbone.

"Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs." — Google DeepMind

What actually changed

Encoder-free multimodal: Traditional multimodal models pipe images and audio through separate encoder networks before the LLM ever sees them. Gemma 4 12B removes those entirely. Vision gets a lightweight embedding module (a single matrix multiplication + positional embedding). Audio skips encoding altogether — the raw signal is projected directly into the same token space as text.

Near-26B benchmark performance at half the footprint: On standard benchmarks it runs neck-and-neck with Gemma 4 26B, and actually surpasses it on DocVQA (document visual question answering).