The promised hit rate of an exact-match LLM cache is 5-15% on real production traffic. Most teams that deploy one see hit rates near zero for the first few weeks and assume caching doesn't work for their workload. It almost always works; the cache is just being defeated by trivial request variations that fingerprint differently even though they should hit the same key. This post is the discipline that closes that gap — the seven normalisation pitfalls that break naive cache implementations, with the fix patterns that hold up under production traffic.
The parent guide on AI API caching covers the cache layers and economics; this article goes one level deeper into the fingerprinting discipline that makes Layer 1 (exact-match) actually work.
What fingerprinting is supposed to do
An exact-match cache stores responses keyed by a deterministic identifier — almost always a SHA-256 hash over a canonicalised representation of the request. When a new request arrives, you compute the same hash; if the key exists, return the cached response. The cache is provably correct because the fingerprint guarantees byte-equivalence at the input.
The fingerprint is supposed to capture everything that affects the response and exclude everything that doesn't. The two boundaries are where most teams get into trouble. Including too little misses real cache hits; including too much misses cache hits that should land. Including the wrong things (timestamps, request IDs, user metadata) splits the cache into shards of one entry each.






