Introduction

I wanted to benchmark how much the routing layer matters for LLM inference when the workload has repeated long prefixes.

The setup was intentionally simple: Qwen2.5-7B-Instruct, vLLM, AWS EKS, FSx for Lustre, and eight g5.xlarge GPU nodes. Each node had one NVIDIA A10G GPU and ran one vLLM decode replica. The interesting part was the comparison in front of those same eight pods.

One path used a plain Kubernetes ClusterIP Service, which effectively gives round-robin-style traffic distribution. The other path used llm-d with the precise prefix-cache-aware endpoint picker.

The result was not small. With the same hardware and the same vLLM pods, llm-d finished the 512-concurrency benchmark in 358.7 seconds instead of 840.2 seconds. Output throughput went from 2,742 tok/s to 6,423 tok/s, and mean time to first token dropped from 19.0 seconds to 0.86 seconds.