Recent industry events point to a massive return-on-investment dilemma as AI token spend spirals out of control. Uber reportedly blew through its entire year’s AI budget in months, forcing a drastic re-evaluation of its agentic workflows. Meta has placed hard caps on its internal AI compute spend, while Amazon has shut down its internal AI leaderboard that encouraged engineers to burn cash on LLM tokens. If tech giants with virtually limitless pockets are sweating the compute bill and hitting the brakes, the broader tech ecosystem faces an even steeper uphill climb.

Much of this compute budget is burning on Chain-of-Thought (CoT) prompting and training. CoT is the method where large language models (LLMs) are instructed, or fine-tuned, to “think step-by-step” before delivering a final answer. In general, LLMs perform better on reasoning tasks when forced to generate a sprawling sequence of intermediate “thinking” tokens.

CoT originally became the undisputed industry standard because it was a brilliant, pragmatic hack. It leveraged the existing text-generation interface of autoregressive transformers without requiring structural changes to the underlying architecture. It scaled predictably with added inference-time compute and gave human operators an easily readable, text-based trace of what the model was ostensibly doing.