Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

If your multi-LLM workload sends the same long system prompt or file dump to Claude / Codex / Gemini ten times an hour, you are paying for the same input tokens ten times. Each provider has a cache for exactly this case, and each one expresses the cache differently. This post is about how llm-cli-gateway now uses those caches for you, across all five providers, without you having to re-implement the per-provider cache APIs yourself. I covered the previous round of changes last week, and I closed that piece with a teaser, that Mistral Vibe was next on the list. A week later, Mistral is in, and a much larger change has landed alongside it, which is what most of this follow-up is about.

The new shape of the gateway: it now understands prompt caching as a first-class concern, across all five providers. That is claude, codex, gemini, grok, and mistral (Vibe). v1.6.0 shipped today and contains the lot.

Short version: every *_request and *_request_async tool now accepts a structured promptParts shape, the gateway concatenates the parts in a canonical order so the stable bytes precede the volatile tail unchanged across calls, three new cache_state:// MCP resources expose hit-rate / hit-count / estimated-savings aggregates back to the orchestrating agent, session_get projects a compact cacheState view at read time, and a cache_ttl_expiring_soon warning fires on Claude resumes when the Anthropic cache breakpoint is within 30 seconds of expiry. All of it is opt-in (every flag defaults off in 1.x), all of it observes the per-provider cache mechanism rather than fighting it, and none of it adds conversation content to gateway storage.

Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Other newsrooms on this story

Related reading

Tracking Five Upstreams, Fuzzing the Parsers, and a Front Door: What Changed in…

Going Remote, Without Going Reckless: Multi-LLM Orchestration and the New Front…

Prompt Caching Explained: How to Cut LLM Costs by 30–99%

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

Prompt caching vs the long LLM conversation: where your input bill actually…

Other newsrooms on this story

Related reading

Tracking Five Upstreams, Fuzzing the Parsers, and a Front Door: What Changed in…

Going Remote, Without Going Reckless: Multi-LLM Orchestration and the New Front…

Prompt Caching Explained: How to Cut LLM Costs by 30–99%

Prefix caching at scale: when it saves you 80% of prefill cost, and the…

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

Prompt caching vs the long LLM conversation: where your input bill actually…