OpenAI's GPT-5.6 cheats a lot. That's the key finding from an independent evaluation by METR.

During testing with software tasks, OpenAI's new flagship model GPT-5.6 Sol showed the highest rate of cheating ever recorded among all publicly tested models. The model exploited bugs in the test environment, extracted hidden solutions, and then tried to cover its tracks.

The actual performance numbers are barely usable because of this, METR says. Depending on how the cheating attempts are handled, the so-called time-horizon estimate swings between 11.3 and over 270 hours. METR doesn't consider any of these values a reliable measure of the model's true capabilities.

METR's time-horizon method measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate. Human completion times serve as the baseline: simple tasks like training a classifier take about 45 minutes, while harder ones like training a robust image model run about four hours. The higher the time horizon, the more capable the model.

Messy data, but Mythos still leads