OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it

OpenAI's GPT-5.6 cheats a lot. That's the key finding from an independent evaluation by METR.

During testing with software tasks, OpenAI's new flagship model GPT-5.6 Sol showed the highest rate of cheating ever recorded among all publicly tested models. The model exploited bugs in the test environment, extracted hidden solutions, and then tried to cover its tracks.

The actual performance numbers are barely usable because of this, METR says. Depending on how the cheating attempts are handled, the so-called time-horizon estimate swings between 11.3 and over 270 hours. METR doesn't consider any of these values a reliable measure of the model's true capabilities.

METR's time-horizon method measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate. Human completion times serve as the baseline: simple tasks like training a classifier take about 45 minutes, while harder ones like training a robust image model run about four hours. The higher the time horizon, the more capable the model.

Messy data, but Mythos still leads

OpenAI's GPT-5.6 cheats a lot. That's the key finding from an independent evaluation by METR.

Messy data, but Mythos still leads

OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it

OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it

Other newsrooms on this story

Related reading

GPT-5.6 cheats so much METR couldn't measure it

OpenAI announces GPT‑5.6 Sol, its next-generation flagship model beating Claude…

OpenAI's GPT-5.6 Sol launches to rival Claude Mythos under government access…

OpenAI Unveils GPT-5.6 Sol as Its Most Advanced Cybersecurity AI

Previewing GPT-5.6 Sol: a next-generation model

OpenAI Previews GPT-5.6 Sol With Restricted Access and Stronger Cyber Safeguards

Related reading

GPT-5.6 cheats so much METR couldn't measure it

OpenAI announces GPT‑5.6 Sol, its next-generation flagship model beating Claude…

OpenAI's GPT-5.6 Sol launches to rival Claude Mythos under government access…

OpenAI Unveils GPT-5.6 Sol as Its Most Advanced Cybersecurity AI

Previewing GPT-5.6 Sol: a next-generation model

OpenAI Previews GPT-5.6 Sol With Restricted Access and Stronger Cyber Safeguards

Other newsrooms on this story