Your token costs are growing faster than your usage. You've already optimized model selection on non-critical paths. Now you need real wins on your main feature without tanking quality.

Most token optimization advice is too generic. "Use shorter prompts" or "cache your context" is true but useless—it doesn't tell you where the actual bloat is, what the real tradeoffs look like, or when to stop optimizing because you're just hurting yourself.

This guide covers the full stack: code-level techniques (structured output, trimming, compression, caching, batching), infrastructure wins, and the cost governance layer that actually makes this stick in production. Each has real numbers. By the end you'll know what works, what doesn't, and when you're optimizing the wrong thing.

Part 1: Code-Level Wins

Technique 1: Structured output