TL;DR: Our internal flaky-test summariser at Buildkite was firing ~40k LLM calls a day, and most were near-duplicates of failures we'd already explained. Switching on semantic caching in Bifrost cut live provider calls by 58% and dropped p50 latency on cache hits from ~900ms to about 40ms. It also kept the feature alive when our primary provider browned out for 11 minutes.
The feature that wouldn't shut up
On our platform team (eight of us) we shipped a small thing last quarter: when a test goes flaky in a Buildkite pipeline, we pass the failure output to an LLM and stick a plain-English summary on the build page. Devs liked it. The provider bill less so.
By March it was making roughly 40,000 calls a day against anthropic/claude-haiku, with openai/gpt-4o-mini as the fallback. p50 latency sat around 900ms. The monthly bill crept past $310. Not catastrophic. But the calls were doing the same work over and over.
Why the calls were so repetitive







