Prefix caching in vLLM under multi-tenant agent traffic

TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly the same on another. The split wasn't about traffic volume. It was about how each team templated their system prompts.

The setup

Our fine-tuning team serves 14 enterprise agents through a shared inference cluster. Four H100 nodes, vLLM 0.6.x, Qwen2.5-32B as the workhorse model. Traffic is bursty. One customer's nightly workflow can hit 8k requests in twenty minutes while another trickles through 30 calls an hour.

Before turning on prefix caching, average TTFT across the cluster sat at 410ms p50, 1.2s p95. Cost wasn't the urgent problem. Latency was, because agents loop. A 400ms TTFT on a 12-step plan turns into 4.8 seconds of dead time before the user sees anything.

What the cache actually does

The setup

What the cache actually does

Prefix caching in vLLM under multi-tenant agent traffic

Prefix caching in vLLM under multi-tenant agent traffic

Other newsrooms on this story

Related reading

Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Measuring AI Gateway Failover: 30 Days of Production Data

Token-level eval harness for tool-calling agents: what we wired up

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…

Capping VLM spend per CV researcher: hierarchical budgets in practice

API Latency in LLM Apps: Causes & How to Fix It

Other newsrooms on this story

Related reading

Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Measuring AI Gateway Failover: 30 Days of Production Data

Token-level eval harness for tool-calling agents: what we wired up

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…

Capping VLM spend per CV researcher: hierarchical budgets in practice

API Latency in LLM Apps: Causes & How to Fix It