ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

ByteDance Seed shows that a 7B model can answer questions on long, image-heavy documents more reliably than much larger models, even when documents are four times longer than anything it saw during training. Instead of transcribing pages, the model learns by answering questions and finding the right passages on its own.

domenica 24 maggio 2026 New tab

Multimodal AI models are supposed to handle ever-longer documents, but how they're trained to do so usually stays a trade secret. A new study shows that character recognition as a training task actually hurts performance and that question-answer pairs work far better.

Researchers from ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) studied how image-language models can be trained efficiently on long documents. The result is a model called MMProLong, built on Alibaba's open Qwen2.5-VL, that beats much larger competitors.

Modern multimodal AI models need to handle increasingly long inputs: entire PDF collections of rendered pages, hours of video, or agents that remember their tasks across many steps. AI labs like OpenAI, Google, and Alibaba tout context windows of up to 1 million tokens, capable of holding not just text but thousands of page images or video frames. But according to the authors, technical reports barely reveal what data a model should see and in what mix.

Asking questions teaches more than transcribing text

At first glance, the study's central finding seems obvious. For a multimodal model to learn to find the right spot in a 100-page document, having it transcribe the text of every page barely helps. It's more effective to ask questions whose answers are buried somewhere in those pages.

Asking questions teaches more than transcribing text

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

Other newsrooms on this story

Related reading

AI won't become a real coworker until it stops answering and starts finishing…

How test-time training allows models to ‘learn’ long documents instead of just…

MIT's MeMo framework boosts LLM performance by 26% without retraining

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

Baidu's "Unlimited OCR" processes dozens of document pages in one pass by…

An AI model that thinks like we do offers new ways to peer inside the black box

Other newsrooms on this story

Related reading

AI won't become a real coworker until it stops answering and starts finishing…

How test-time training allows models to ‘learn’ long documents instead of just…

MIT's MeMo framework boosts LLM performance by 26% without retraining

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

Baidu's "Unlimited OCR" processes dozens of document pages in one pass by…

An AI model that thinks like we do offers new ways to peer inside the black box