A few months ago, I was building a chatbot for a client that needed to handle customer support queries. The requirements were straightforward: answer common questions, escalate complex issues, and keep latency under 2 seconds. I started with OpenAI’s API because it’s easy, but after a week of testing, the bill was already climbing into triple digits. That’s when I realized I couldn’t just throw more money at the problem—I needed a smarter architecture.

The Problem: Every query costs money

I had a list of about 200 common support questions that covered 80% of what users asked. But my naive implementation sent every single user message to GPT-4. Even with prompt caching and reduced tokens, each conversation was racking up $0.03–$0.10 per turn. Multiply that by hundreds of users, and it became unsustainable fast.

An even bigger issue: latency. For simple questions like “What are your business hours?” a full round-trip to the API took 1–3 seconds. Users expected instant answers, not a spinning loader.

What I tried that didn’t work