I expected the cheaper model to be cheaper. It cost 8.6 more.

I'd routed the same one-word prompt to Claude Haiku and to Gemini 2.5 Flash. Flash has the lower per-token price, so this should have been an easy win. It wasn't. Flash is a thinking model: before it answered "Paris," it spent a few dozen tokens reasoning, and reasoning is billed as output. Haiku answered in 4 tokens. Flash spent about 28 — lower unit price, far more units, ~8.6× the bill per request. I only caught it because I'd instrumented every call: tokens, cost, latency, written to Postgres. And I instrument every call because that's what you do when you've spent years keeping payment systems honest.

That instinct is the whole point. For two and a half years I built cross-border real-time payments at NPCI — the kind of system where a rounding error is an incident and a downstream going dark is a Saturday night. This year I built an LLM gateway, and I kept reaching for the same tools. That's not a coincidence.

AI infrastructure looks like a new field. Underneath, it's the systems work backend engineers have always done. A model API is a downstream dependency: it's slow, it's occasionally down, it rate-limits you, and it bills you per call. You have integrated a dependency exactly like that before — a payment processor, a KYC provider, a partner bank. The model isn't magic. It's a new kind of expensive, flaky downstream, and the hard parts — reliability, cost control, failover — are the parts we already solved.

I expected the cheaper model to be cheaper. It cost 8.6 more.

Other newsrooms on this story

Related reading

Gemini 3.6 Flash: the Thinking Dial That Moves Cost 30x (Measured)

Gemini 3.5 Flash vs Claude Haiku 4.5 vs MAI-Code-1-Flash for Coding

How I Made the Cheapest Model Match the Best — at 1/640th the Cost

Chinese AI Models Are 40x Cheaper Than GPT-4o — Here's the Proof

Google's Gemini 3.5 Flash follows Anthropic and OpenAI in making newer AI…

Why Your Gemini Bill Doesn't Match the Model Names