Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo | NVIDIA Technical Blog

Coding agents are starting to write production code at scale. Stripe’s agents generate 1,300+ PRs per week. Ramp attributes 30% of merged PRs to agents. Spotify reports 650+ agent-generated PRs per month. Tools like Claude Code and Codex make hundreds of API calls per coding session, each carrying the full conversation history. Behind every one of these workflows is an inference stack under significant KV cache pressure.

Figure 1. Cumulative KV cache reads outpace writes in agentic inference due to repeated reuse of prompt and context across sequential requests.

Lets take Claude Code as an example. After the first API call that writes the conversation prefix to KV cache, every subsequent call to the same worker hits 85-97% cache. Agent teams (or swarms) push this further with 97.2% aggregate cache hit rate across 4 Opus teammates. An 11.7x read/write ratio means the system reads from cache nearly 12 times for every token it writes. This is a write-once-read-many (WORM) access pattern: the system prompt and growing conversation prefix are computed once, then served from cache on every subsequent call. Maximizing cache reuse rate across all workers and keeping KV blocks warm and routable is the central optimization target for agentic inference.

Figure 1. Cumulative KV cache reads outpace writes in agentic inference due to repeated reuse of prompt and context across sequential requests.

Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo | NVIDIA Technical Blog

Full-Stack Optimizations for Agentic Inference with NVIDIA Dynamo | NVIDIA Technical Blog

Related reading

Coding Is No Longer the Constraint: Scaling Developer Experience to Teams and…

Inside Claude Code

Claude Code's creator says his setup involves thousands of AI sub-agents doing…

Salesforce claims AI agents cut a 231-day migration to 13 days with fewer…

Anthropic says 80% of its new production code is now authored by Claude — how…

Stripe Projects adds new agent integrations, more providers, and custom…

Related reading

Coding Is No Longer the Constraint: Scaling Developer Experience to Teams and…

Inside Claude Code

Claude Code's creator says his setup involves thousands of AI sub-agents doing…

Salesforce claims AI agents cut a 231-day migration to 13 days with fewer…

Anthropic says 80% of its new production code is now authored by Claude — how…

Stripe Projects adds new agent integrations, more providers, and custom…