A few months ago, I was building a feature that needed to generate summaries of user-uploaded documents. I went straight to OpenAI's API — it's the obvious choice, right? Within a week, I hit the rate limit wall. Hard. My app would just return 429 errors during peak usage. Users saw blank pages. I tried retries with exponential backoff, but that made things worse — queued requests backed up, and the latency became unbearable.
Then I looked at the bill. $300 for a week of moderate usage. I knew I needed a different approach.
What I tried first
My first instinct was to cache responses. If two users uploaded the same document, why call the API twice? So I added a simple in-memory cache keyed by a hash of the document. That helped a bit, but most documents were unique. Cache hit rate was maybe 5%.
Next, I tried switching to a cheaper provider. I experimented with Anthropic's Claude and Cohere. Their APIs were different, authentication was different, response formats were different. I ended up writing a messy adapter layer that still broke whenever I switched providers mid-request.







