Dan Steele, CEO of Listening Post.gettyMicrosoft gave thousands of engineers access to Claude Code in December 2025. By May, they were canceling most of those licenses and moving engineers over to Copilot. The official reason was toolchain unification. The real reason might have been the bill. The team had burned through its annual AI budget in a matter of months.Uber reportedly exhausted its planned 2026 AI tooling budget by April after deploying Claude Code across roughly 5,000 engineers. Company leaders later raised questions about whether rapidly rising AI spending was translating into proportional product impact.Amazon went a different direction and ended up with a different problem. Reports indicate the company tracked AI usage through internal leaderboards and set adoption targets covering more than 80% of developers. Some employees reportedly began using AI agents for low-value or unnecessary tasks to boost their rankings, a practice insiders dubbed "tokenmaxxing." Amazon later moved to shut down the leaderboard and urged employees not to use AI simply for the sake of increasing usage metrics.The API default is where the money goes.When a developer connects an application to an AI API, they point it at the best model available. That is the model the team tested against. That is the model that held up in demos. Changing it requires a decision, and the default requires nothing. So every call goes to the same place, whether it is asking the model to restructure a JSON payload or reason through a security architecture.The output on the JSON restructuring is not better from a frontier model than it would be from something a tenth of the cost.At low volumes, that difference disappears into the noise. Running it at a large scale is the entire problem.Token-based pricing does not work the way seat licensing does. A seat license costs the same whether the employee uses the software all day or barely touches it. A token-priced API charges for every unit of compute the model runs. When every request goes to the most capable model on the market, you are paying for maximum compute on requests that do not need it. That is the math that can end budget years within months.The capability gap between model tiers is not what it was.Models that were mid-tier 18 months ago are now handling tasks that required frontier models at the time. The gap at the top keeps widening on hard benchmarks, but for the work that makes up most of what gets sent to an API (classification, summarization, extraction, formatting, lookup and simple generation), a smaller and cheaper model produces output that is functionally identical. In fact, a large percentage of tasks could be handled by an open-source model hosted locally.The frontier model earns its cost when the task actually needs it, such as complex code generation, multistep reasoning, synthesis across long documents and anything where a wrong answer compounds into a bigger problem downstream. That work is real. It is also not most of what gets sent.Building routing logic into API architecture means deciding before the request goes out which category it belongs to. That decision does not require a frontier model to make. A lightweight classifier or a rules layer handles it with minimal overhead. The savings at scale are not minimal.Every call to a single provider is also a data decision.When all API traffic routes through one commercial provider, that provider holds a complete record of everything the organization asked. The sensitive strategic questions that genuinely needed a frontier model are sitting in the same pipeline as the routine document formatting and the internal summaries. Segmenting that data at the architecture level is straightforward. Most organizations have not done it because they never had to think about it.Most teams cannot cleanly answer which queries are leaving their environment, through which provider, under what retention terms. That is a problem that grows with volume. Routing by sensitivity alongside routing by complexity handles both at once. Work that should not leave the building goes to local or self-hosted models. Routine low-sensitivity work goes to cheap, fast, commercial models. The expensive frontier API handles what it is actually built for.Nobody in the current setup has a reason to fix this for you.Model providers make more money when more tokens flow through the most expensive endpoints. Platforms built on top of them default to the flagship model because it performs best in evaluations and generates the fewest complaints. The engineer making the API call and the person responsible for the infrastructure budget are usually not the same person. The cost accumulates at a distance from the decision that created it.The Amazon situation shows what happens when organizations try to solve adoption by measuring the wrong thing. Token consumption as a performance metric produces token consumption. It does not produce value. Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure.") has been around long enough that it should not keep catching enterprise AI programs off guard, but here we are.The organizations building routing into their AI infrastructure now could have lower costs, cleaner data practices and more flexibility as usage grows. The ones treating every call as equivalent will eventually be forced to sort it out under worse conditions, with more technical debt and a larger bill already on the table.Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
How Tokenmaxxing Became Enterprise AI's Biggest Unforced Error
Model providers make more money when more tokens flow through the most expensive endpoints.










