If you're building on top of an LLM API and the bill is starting to bite, you've probably read that caching is the answer. The follow-up question is which kind of caching, and the honest answer is: usually both, but for different reasons. Exact-match caching costs you almost nothing to run and never returns a wrong answer; the catch is that it hits maybe one in ten requests in production. Semantic caching catches several times that volume but introduces a correctness risk you have to engineer for. This post walks through where each one wins, the math behind the tradeoff, and how to decide what to run for your workload.

Caching is part of AI API caching as a discipline — exact and semantic are two of the three layers; the third is provider-native cache passthrough, covered separately.

Definitions, briefly

Exact-match caching computes a deterministic fingerprint of the request (typically SHA-256 over the normalized messages array, model name, temperature, and other request parameters), then looks up that fingerprint in a key-value store like Redis. If the fingerprint exists, return the cached response. Lookup is O(1) and sub-10ms p95. The store is bounded by your cache size budget; entries evict by LRU or TTL.