Most video understanding models drown in data at inference time. Imagine watching a 60-minute security video just to answer: “Who entered the room after the clock rang?” Many models still scan frame by frame—processing thousands of irrelevant images to find a few useful moments. That is slow, expensive, and fundamentally inefficient. What if AI could search instead of see everything?

This question motivated us to rethink how vision-language models (VLMs) approach long-form video. In our paper “T*: Re-thinking Temporal Search for Long-Form Video Understanding” (CVPR 2025, arXiv:2504.02259), we introduce a simple yet powerful idea: before analyzing details, first find the few frames that matter.

The “Long Video Haystack” Problem

To formalize this challenge, we introduce the Long Video Haystack problem: given a long video and a question, locate the minimal set of frames, usually just one to five, that are sufficient to answer it. This mirrors how humans operate: we fast-forward, skim, and search for visual cues.

We develop LV-HAYSTACK, the first large-scale benchmark dataset for temporal search in long videos. Spanning 480 hours of egocentric and allocentric video, the dataset contains 15,092 human-annotated QA pairs. Each instance includes a real-world video, a question, and a small set of human-annotated keyframes that answer it.