A few months ago, my OpenAI API bill suddenly jumped from a modest $30 to over $150 in one month. I wasn't even doing anything crazy — just running a small Slack bot that answered questions about our internal docs. But between repeated prompts, failed retries, and my own debugging queries, the tokens added up fast.
I tried the obvious fixes first: adding client-side caching, switching to gpt-3.5-turbo from gpt-4, and even imposing manual rate limits on myself. None of it stuck. Caching exact prompts doesn’t work when users ask the same question but rephrase it slightly. And rate limits just made the bot feel sluggish.
So I built a lightweight AI proxy — a thin middleware layer between my app and the LLM provider. It wasn't flashy, but it immediately stopped the bleeding. Here’s the honest story of what I did, what I broke along the way, and what I’d do differently next time.
What I tried (and what didn’t work)
Client-side caching






