An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in just 14 hours. But every model tested still fails on the most complex tasks.

venerdì 26 giugno 2026 New tab

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate entire programs on their own. Claude Opus 4.7 leads with 56 percent, but every model still fails on the most complex tasks.

In the new MirrorCode coding benchmark from Epoch AI and METR, AI models have to reimplement complete programs from scratch without access to the original source code.

The 25 target programs cover Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each AI-generated solution must exactly reproduce the output of the original program, including hidden end-to-end tests the model never sees during development.

Another difference from many other benchmarks is the inference budget. Existing software engineering benchmarks often cap costs at $1 to $10 per task, even when a human would need weeks to finish the same work, the developers write.

According to Epoch AI, one of the largest tasks in MirrorCode cost $2,600 for a single run. The AI worked continuously for 19 days with no human involvement at all.

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate entire programs on their own. Claude Opus 4.7 leads with 56 percent, but every model still fails on the most complex tasks.

In the new MirrorCode coding benchmark from Epoch AI and METR, AI models have to reimplement complete programs from scratch without access to the original source code.

According to Epoch AI, one of the largest tasks in MirrorCode cost $2,600 for a single run. The AI worked continuously for 19 days with no human involvement at all.

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Other newsrooms on this story

Related reading

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source…

Cognition introduces FrontierCode benchmark that exposes AI coding agents'…

Anthropic releases Claude Sonnet 4.5 in latest bid for AI agents and coding…

Anthropic’s new Claude 4.1 dominates coding tests days before GPT-5 arrives

DeepSeek Is Building Its Own Claude Code. Beijing Wants the Whole Stack -…

AI optimizer beats Claude Code, Codex by 2.5x

Other newsrooms on this story

Related reading

MirrorCode evaluates AI's long-horizon coding capabilities with 22 open-source…

Cognition introduces FrontierCode benchmark that exposes AI coding agents'…

Anthropic releases Claude Sonnet 4.5 in latest bid for AI agents and coding…

Anthropic’s new Claude 4.1 dominates coding tests days before GPT-5 arrives

DeepSeek Is Building Its Own Claude Code. Beijing Wants the Whole Stack -…

AI optimizer beats Claude Code, Codex by 2.5x