TL;DRAI

Zyphra rilascia Zamba2-VL (1.2B–7B), VLM ibrido Mamba2-Transformer con TTFT 10x inferiore ai transformer. Design SSM lineare abilita on-device inference, document/video processing senza overhead quadratico KV-cache; target edge assistants e invoice parsing.

Zyphra has released Zamba2-VL, a family of open vision-language models. The release covers three sizes: 1.2B, 2.7B, and 7B parameters. Each model is built on the Zamba2 hybrid SSM–Transformer backbone.

Vision-language models (VLMs) read images and text together. They answer questions about charts, documents, and photos. Most open VLMs use a dense Transformer as the language model. Zamba2-VL replaces that with a hybrid state-space design. The goal is competitive accuracy at lower latency.

What is Zamba2-VL

Zamba2-VL follows the now-standard LLaVA-style VLM template. A pre-trained vision encoder turns image patches into features. A lightweight MLP adapter projects those features into the language model’s space. The language model then reads an interleaved sequence of vision and text tokens. The models support single and multi-image understanding and grounding.

Zyphra pairs each Zamba2 backbone with the Vision Transformer from Qwen2.5-VL. That encoder was chosen for two specific properties. It uses 2D rotary position embeddings and native dynamic-resolution processing. A two-layer MLP adapter connects the encoder to the backbone.

marktechpost.com

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

Zyphra released Zamba2-VL, open hybrid SSM–Transformer VLMs at 1.2B, 2.7B, and 7B with lower time-to-first-token than Transformers.

venerdì 12 giugno 2026 New tab

TL;DRAI

1,211 words~6 min read

What is Zamba2-VL

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

Other newsrooms on this story

Related reading

Mistral’s Voxtral goes beyond transcription with summarization,…

A $1,500 foundation model that rivals larger LLMs

Sapient trains 1B-parameter HRM-Text model for $1,500 in 1.9 days

Which LLM should you use? Token Monster automatically combines multiple models…

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with…

Small language models: Rethinking enterprise AI architecture

Related reading

Mistral’s Voxtral goes beyond transcription with summarization,…

A $1,500 foundation model that rivals larger LLMs

Sapient trains 1B-parameter HRM-Text model for $1,500 in 1.9 days

Which LLM should you use? Token Monster automatically combines multiple models…

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with…

Small language models: Rethinking enterprise AI architecture

Other newsrooms on this story