Most semantic cache benchmarks are a vendor showing you the one dataset where they win, on a model they finetuned, against a competitor they configured badly. You read it, you nod, you learn nothing.

I built and maintain a semantic cache library (@betterdb/semantic-cache on npm, betterdb-semantic-cache on PyPI, MIT, Valkey-native). So I had two choices. Write that post about my own library, or run the comparison straight and publish it even where I only tie. I did the second one. Four public datasets, two peers (RedisVL and Upstash), one self-tuning loop, and a fair amount of being wrong before being right.

There was no honest cross-library comparison of semantic caches anywhere I could find. So I made one. This is the short version. Links to the full tables and methodology are at the bottom.

1. Quality is a tie. That is the result you want.

Fix the embedding model and every honest semantic cache is doing the same thing: embed the prompt, measure cosine distance against stored prompts, return a hit below a threshold. So peak F1 converges. There is no secret sauce in the lookup.