Back to Articles

Published on June 27, 2026

Abstract

When multimodal large models (VLMs) begin to enter real-world cameras, drones, and embodied robots, it is no longer sufficient to discuss only "how intelligent" a model is. Robots need not only to understand "what is in the image," but also to know precisely "where it is."

However, although today's mainstream VLMs perform well in high-level scene understanding, they often struggle with fine-grained perception tasks that require accurate localization. To address this limitation, we introduce VLX-Seek. As an efficient inference model designed for on-device embodied vision, VLX-Seek pushes VLM capabilities beyond "understanding what they see" toward precise localization.