Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.

ByteDance research team took a different approach with Lance. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.

https://arxiv.org/pdf/2605.18678

What Lance Can Do

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.