One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.

ByteDance research team took a different approach with Lance. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.

https://arxiv.org/pdf/2605.18678

What Lance Can Do

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.

https://arxiv.org/pdf/2605.18678

What Lance Can Do

One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

Other newsrooms on this story

Related reading

Expanding Together AI Model Library into multimedia generation with 40+ new…

ByteDance unveils Seedance 2.5, a 30-second native 4K AI video model that…

ByteDance's Seedance 2.5 breaks the 30-second barrier for AI video generation

Building a Multimodal AI Pipeline: Text Image Text Across Three Providers

VLX-Flow: Continuous Video Understanding for Real-Time Multimodal Interaction

NEO-unify: Building Native Multimodal Unified Models End to End

Other newsrooms on this story

Related reading

Expanding Together AI Model Library into multimedia generation with 40+ new…

ByteDance unveils Seedance 2.5, a 30-second native 4K AI video model that…

ByteDance's Seedance 2.5 breaks the 30-second barrier for AI video generation

Building a Multimodal AI Pipeline: Text Image Text Across Three Providers

VLX-Flow: Continuous Video Understanding for Real-Time Multimodal Interaction

NEO-unify: Building Native Multimodal Unified Models End to End