NEO-unify: Building Native Multimodal Unified Models End to End

Back to Articles

Existing Multimodal AI Dilemma

Existing Multimodal AI Dilemma NEO-unify: End-to-End Native Unified Model Paradigm Model Performance 1. Quantitative Results 2. Qualitative Results Key Findings 1. Encoder-Free Design Preserves Both Semantic and Pixel Representations 2. Encoder-Free Design Synergizes with MoT Backbone with Minimal Intrinsic Conflict 3. Encoder-Free Design Shows High Data-scaling Efficiency Outlook For years, multimodal AI typically adopts a vision encoder (VE) to perceive and a variational autoencoder (VAE) to generate. Recent efforts seek to unify both with a shared tokenizer — but often with trade-offs. We return to the first principles: Building a model that directly engages with native inputs — pixels and words.

Today, SenseTime, in collaboration with NTU, introduces a native, unified, end-to-end paradigm dubbed NEO-unify (preview) — stepping beyond representation arguments, and breaking free from pre-trained priors or scaling-law bottlenecks. No VE! No VAE!

NEO-unify: End-to-End Native Unified Model Paradigm

NEO-unify: Building Native Multimodal Unified Models End to End

Related reading

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence…

EMO: Pretraining mixture of experts for emergent modularity

Mira Murati's Thinking Machines Lab Introduces Interaction Models: A Native…

Thinking Machines shows off preview of near-realtime AI voice and video…

Alibaba challenges OpenAI and Google with new multimodal AI model

Researchers train AI model that hits near-full performance with just 12.5…