Back to Articles

Existing Multimodal AI Dilemma

Existing Multimodal AI Dilemma NEO-unify: End-to-End Native Unified Model Paradigm Model Performance 1. Quantitative Results 2. Qualitative Results Key Findings 1. Encoder-Free Design Preserves Both Semantic and Pixel Representations 2. Encoder-Free Design Synergizes with MoT Backbone with Minimal Intrinsic Conflict 3. Encoder-Free Design Shows High Data-scaling Efficiency Outlook For years, multimodal AI typically adopts a vision encoder (VE) to perceive and a variational autoencoder (VAE) to generate. Recent efforts seek to unify both with a shared tokenizer — but often with trade-offs. We return to the first principles: Building a model that directly engages with native inputs — pixels and words.

Today, SenseTime, in collaboration with NTU, introduces a native, unified, end-to-end paradigm dubbed NEO-unify (preview) — stepping beyond representation arguments, and breaking free from pre-trained priors or scaling-law bottlenecks. No VE! No VAE!

NEO-unify: End-to-End Native Unified Model Paradigm