If your multi-LLM workload sends the same long system prompt or file dump to Claude / Codex / Gemini ten times an hour, you are paying for the same input tokens ten times. Each provider has a cache for exactly this case, and each one expresses the cache differently. This post is about how llm-cli-gateway now uses those caches for you, across all five providers, without you having to re-implement the per-provider cache APIs yourself. I covered the previous round of changes last week, and I closed that piece with a teaser, that Mistral Vibe was next on the list. A week later, Mistral is in, and a much larger change has landed alongside it, which is what most of this follow-up is about.
The new shape of the gateway: it now understands prompt caching as a first-class concern, across all five providers. That is claude, codex, gemini, grok, and mistral (Vibe). v1.6.0 shipped today and contains the lot.
Short version: every *_request and *_request_async tool now accepts a structured promptParts shape, the gateway concatenates the parts in a canonical order so the stable bytes precede the volatile tail unchanged across calls, three new cache_state:// MCP resources expose hit-rate / hit-count / estimated-savings aggregates back to the orchestrating agent, session_get projects a compact cacheState view at read time, and a cache_ttl_expiring_soon warning fires on Claude resumes when the Anthropic cache breakpoint is within 30 seconds of expiry. All of it is opt-in (every flag defaults off in 1.x), all of it observes the per-provider cache mechanism rather than fighting it, and none of it adds conversation content to gateway storage.








