If you ship anything on top of an LLM API, you've probably had this moment: you check the dashboard at the end of the month and the bill is 3x what you modeled. Nothing broke. Usage just... drifted. A few prompts got chattier, one model started "thinking" more, and your per-token math quietly fell apart.

I've been living in that loop, so I want to lay out why per-token pricing is hard to forecast, and a different billing shape that trades some theoretical savings for a number you can actually predict.

Why per-token spend is so hard to model

Per-token billing looks simple — price × tokens — but three things make it slippery in practice:

1. Output length is not yours to control. max_tokens is a hard ceiling, but the model decides how much of that ceiling it actually uses. Two models given the identical prompt can produce wildly different output lengths, and the verbose one costs you more for the same task.