T*: Rethinking Temporal Search for Long-Form Video Understanding

Most video understanding models drown in data at inference time. Imagine watching a 60-minute security video just to answer: “Who entered the room after the clock rang?” Many models still scan frame by frame—processing thousands of irrelevant images to find a few useful moments. That is slow, expensive, and fundamentally inefficient. What if AI could search instead of see everything?

venerdì 17 ottobre 2025 New tab

This question motivated us to rethink how vision-language models (VLMs) approach long-form video. In our paper “T*: Re-thinking Temporal Search for Long-Form Video Understanding” (CVPR 2025, arXiv:2504.02259), we introduce a simple yet powerful idea: before analyzing details, first find the few frames that matter.

The “Long Video Haystack” Problem

To formalize this challenge, we introduce the Long Video Haystack problem: given a long video and a question, locate the minimal set of frames, usually just one to five, that are sufficient to answer it. This mirrors how humans operate: we fast-forward, skim, and search for visual cues.

We develop LV-HAYSTACK, the first large-scale benchmark dataset for temporal search in long videos. Spanning 480 hours of egocentric and allocentric video, the dataset contains 15,092 human-annotated QA pairs. Each instance includes a real-world video, a question, and a small set of human-annotated keyframes that answer it.

The “Long Video Haystack” Problem

T*: Rethinking Temporal Search for Long-Form Video Understanding

T*: Rethinking Temporal Search for Long-Form Video Understanding

Related reading

Adobe Research Unlocking Long-Term Memory in Video World Models with…

VLX-Flow: Continuous Video Understanding for Real-Time Multimodal Interaction

Transform Video Into Instantly Searchable, Actionable Intelligence with AI…

Category: Computer Vision / Video Analytics | NVIDIA Technical Blog

Perceptron Mk1 shocks with highly performant video analysis AI model 80-90%…

Real-time video classification with PaliGemma: architecture patterns for…

Related reading

Adobe Research Unlocking Long-Term Memory in Video World Models with…

VLX-Flow: Continuous Video Understanding for Real-Time Multimodal Interaction

Transform Video Into Instantly Searchable, Actionable Intelligence with AI…

Category: Computer Vision / Video Analytics | NVIDIA Technical Blog

Perceptron Mk1 shocks with highly performant video analysis AI model 80-90%…

Real-time video classification with PaliGemma: architecture patterns for…