Real-time video classification with PaliGemma: architecture patterns for low-latency VLM inference

In a previous article, we benchmarked three open-source Vision-Language Models on zero-shot object detection and arrived at an uncomfortable conclusion: even the fastest contender, Phi-3.5-vision-instruct, takes 4.45 seconds per frame on an NVIDIA L4. LLaVA-v1.6 sits at 8.13 seconds. For any application that needs to process a live video stream, these numbers are disqualifying. But the conclusion that VLMs are fundamentally incompatible with real-time workloads deserves more scrutiny. That 8-second figure was measured on a general-purpose zero-shot detection task, asking the model to reason about arbitrary objects in arbitrary scenes. What happens when you constrain the problem? When you give the model a closed vocabulary, a fixed resolution, a deterministic decoding strategy, and a non-blocking inference pipeline?

This article answers that question. Using PaliGemma, Google’s compact vision-language model, we built a real-time video classification system running at approximately 0.8 to 1.2 seconds per frame on an NVIDIA RTX A4500. That is a six to eight times improvement over LLaVA on comparable professional hardware, achieved entirely through architectural decisions rather than hardware upgrades. Here are the four patterns that made it possible.

Real-time video classification with PaliGemma: architecture patterns for low-latency VLM inference

Other newsrooms on this story

Related reading

Gemma 4 12B: The Developer Guide- Google Developers Blog

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with…

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Google brings multi-token prediction Gemma 4 LLMs - TechTalks

Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16…

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark