Multimodal AI models are supposed to handle ever-longer documents, but how they're trained to do so usually stays a trade secret. A new study shows that character recognition as a training task actually hurts performance and that question-answer pairs work far better.
Researchers from ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) studied how image-language models can be trained efficiently on long documents. The result is a model called MMProLong, built on Alibaba's open Qwen2.5-VL, that beats much larger competitors.
Modern multimodal AI models need to handle increasingly long inputs: entire PDF collections of rendered pages, hours of video, or agents that remember their tasks across many steps. AI labs like OpenAI, Google, and Alibaba tout context windows of up to 1 million tokens, capable of holding not just text but thousands of page images or video frames. But according to the authors, technical reports barely reveal what data a model should see and in what mix.
Asking questions teaches more than transcribing text
At first glance, the study's central finding seems obvious. For a multimodal model to learn to find the right spot in a 100-page document, having it transcribe the text of every page barely helps. It's more effective to ask questions whose answers are buried somewhere in those pages.













