Google DeepMind just released Gemma 4 12B, a dense multimodal model that strips out traditional encoders entirely. Vision and audio flow straight into the LLM backbone. The result is a model that runs agentic workflows on a consumer laptop with 16 GB of RAM. It ships under the Apache 2.0 license.

Model Overview & Access

Gemma 4 12B is a 12-billion-parameter decoder-only transformer. It handles text, images, audio, and video natively. There are no separate vision or audio encoders. The decoder uses the same structure as the Gemma 4 31B Dense model. It bridges the gap between the edge-friendly E4B and the larger 26B Mixture of Experts variant.

Architecture: Unified, encoder-free decoder-only transformer.

Modalities: Text, image, video, and native audio input — the first mid-sized Gemma with audio.