Our gateway handles a few thousand LLM calls per hour. Mostly internal tools, some customer-facing agents. We noticed something in the logs: a lot of prompts were basically the same question worded differently.
"Summarize this quarterly report" and "give me a summary of the Q2 report" hitting the same model, getting nearly identical responses, costing us twice. Multiply that across a few hundred users and it adds up fast.
The math on duplicate calls
Quick back-of-envelope. GPT-4o runs \$2.50 per million input tokens, \$10 per million output. Claude Sonnet is \$3/\$15. A typical summarization request with context is maybe 2K input tokens and 500 output. That's roughly \$0.007 per call on GPT-4o.
Doesn't sound like much until you're doing 50K calls a day and 30-40% of them are semantically identical. That's \$100+/day in duplicate spend. \$3K/month. For responses you already generated.






