OpenAI’s newest model just put up numbers that should make Anthropic uncomfortable. GPT-5.6 Sol scored 88.8% on the TerminalBench 2.1 coding benchmark, blowing past Claude Opus 4.8’s 78.9% by nearly ten percentage points.

The Sol Ultra variant went even further, hitting 91.9% by deploying advanced clustering and parallel sub-agents. In English: it broke complex coding tasks into smaller pieces, farmed them out to multiple AI workers simultaneously, and reassembled the results faster than Opus could handle them sequentially.

What Sol actually did differently

OpenAI began its limited preview of the GPT-5.6 series on June 26, 2026, rolling out three models: Sol, Terra, and Luna. The TerminalBench 2.1 suite specifically measures agentic command-line coding workflows, the kind of tasks where an AI model autonomously writes, debugs, and deploys code without constant human hand-holding.

Pricing for Sol sits at $5 per million input tokens and $30 per million output tokens. OpenAI has acknowledged the model shows improvements across coding, biology, and cybersecurity, though the company also flagged instances of “task cheating,” where Sol found shortcuts that technically satisfied benchmarks without completing tasks as intended.