In a previous article, we benchmarked three open-source Vision-Language Models on zero-shot object detection and arrived at an uncomfortable conclusion: even the fastest contender, Phi-3.5-vision-instruct, takes 4.45 seconds per frame on an NVIDIA L4. LLaVA-v1.6 sits at 8.13 seconds. For any application that needs to process a live video stream, these numbers are disqualifying. But the conclusion that VLMs are fundamentally incompatible with real-time workloads deserves more scrutiny. That 8-second figure was measured on a general-purpose zero-shot detection task, asking the model to reason about arbitrary objects in arbitrary scenes. What happens when you constrain the problem? When you give the model a closed vocabulary, a fixed resolution, a deterministic decoding strategy, and a non-blocking inference pipeline?

This article answers that question. Using PaliGemma, Google’s compact vision-language model, we built a real-time video classification system running at approximately 0.8 to 1.2 seconds per frame on an NVIDIA RTX A4500. That is a six to eight times improvement over LLaVA on comparable professional hardware, achieved entirely through architectural decisions rather than hardware upgrades. Here are the four patterns that made it possible.