Google Just Shipped an Encoder-Free Multimodal Model That Runs on Your Laptop

Google dropped Gemma 4 12B yesterday. It hit #1 on Hacker News within hours, and the reason isn't just "another model release." The architecture is genuinely different from anything else in the 10-15B parameter range.

Traditional multimodal models use separate encoders for each input type. Vision goes through ViT or CLIP. Audio runs through Whisper or HuBERT. Then all those encoded representations feed into the LLM backbone. It works, but it's wasteful — every encoder adds memory overhead and inference latency.

Gemma 4 12B throws all the encoders away.

Traditional multimodal models (top) rely on separate encoders for vision and audio. Gemma 4 12B (bottom) feeds raw inputs directly into the LLM backbone.

Google Just Shipped an Encoder-Free Multimodal Model That Runs on Your Laptop

Gemma 4 12B throws all the encoders away.

Traditional multimodal models (top) rely on separate encoders for vision and audio. Gemma 4 12B (bottom) feeds raw inputs directly into the LLM backbone.

Google Just Shipped an Encoder-Free Multimodal Model That Runs on Your Laptop

Google Just Shipped an Encoder-Free Multimodal Model That Runs on Your Laptop

Other newsrooms on this story

Related reading

Gemma 4 12B Is Google's Biggest Bet on Local Multimodal AI Yet

Other newsrooms on this story

Related reading

Gemma 4 12B Is Google's Biggest Bet on Local Multimodal AI Yet

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with…

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google's new Gemma 4 open AI model is sized for your laptop

Gemma 4 12B: The Developer Guide- Google Developers Blog