How llm-d Prefix-Cache Routing Made Qwen 7B on EKS 2.3x Faster

Introduction

I wanted to benchmark how much the routing layer matters for LLM inference when the workload has repeated long prefixes.

The setup was intentionally simple: Qwen2.5-7B-Instruct, vLLM, AWS EKS, FSx for Lustre, and eight g5.xlarge GPU nodes. Each node had one NVIDIA A10G GPU and ran one vLLM decode replica. The interesting part was the comparison in front of those same eight pods.

One path used a plain Kubernetes ClusterIP Service, which effectively gives round-robin-style traffic distribution. The other path used llm-d with the precise prefix-cache-aware endpoint picker.

The result was not small. With the same hardware and the same vLLM pods, llm-d finished the 512-concurrency benchmark in 358.7 seconds instead of 840.2 seconds. Output throughput went from 2,742 tok/s to 6,423 tok/s, and mean time to first token dropped from 19.0 seconds to 0.86 seconds.

Introduction

I wanted to benchmark how much the routing layer matters for LLM inference when the workload has repeated long prefixes.

One path used a plain Kubernetes ClusterIP Service, which effectively gives round-robin-style traffic distribution. The other path used llm-d with the precise prefix-cache-aware endpoint picker.

How llm-d Prefix-Cache Routing Made Qwen 7B on EKS 2.3x Faster

Other newsrooms on this story

How llm-d Prefix-Cache Routing Made Qwen 7B on EKS 2.3x Faster

Other newsrooms on this story

Related reading

Prefix caching in vLLM under multi-tenant agent traffic

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they…

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

Qwen3.7 Max vs Open-Weight LLMs: Practical Migration Notes

Related reading

Prefix caching in vLLM under multi-tenant agent traffic

The same 16 GPUs, twice the users: Inference-aware routing for LLM clusters

Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they…

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

Qwen3.7 Max vs Open-Weight LLMs: Practical Migration Notes