Getting a large language model to think harder at inference time, a technique called test-time scaling, has become one of the more reliable ways to squeeze better answers out of AI systems. The problem is that designing those “think harder” strategies has traditionally been a manual, intuition-driven slog. Researchers tinker with heuristics, run expensive experiments, and hope they’ve found something close to optimal.
A new framework called AutoTTS, developed by researchers from Meta, Google, the University of Maryland, the University of Virginia, Washington University in St. Louis, and the University of North Carolina, takes humans largely out of that loop. The result: a roughly 69.5% reduction in token usage compared to strong handcrafted baselines, with essentially no loss in accuracy.
How AutoTTS works, and why the numbers matter
AutoTTS replaces manual process with an agentic loop. The system uses Anthropic’s Claude Code as an explorer agent to autonomously develop, test, and refine inference strategies. Instead of requiring repeated calls to the target LLM during the discovery phase, AutoTTS works from pre-collected reasoning trajectories and probe signals.
The benchmark comparison tells the story. Against SC@64, a well-known handcrafted baseline, AutoTTS achieved its 69.5% token reduction at a specific operating point (beta approximately 0.5) while matching the baseline’s mean held-out accuracy. The discovered strategies scored an average of 45.3 on held-out accuracy versus 45.2 for the baseline.













