New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real-world vulnerabilities in Google's JavaScript engine V8. Mythos leads GPT-5.5 by a wide margin, but it costs a fortune.

Unlike previous tests, the benchmark doesn't just check whether a bug gets triggered. It scores progress across five tiers, all the way up to arbitrary code execution, running whatever commands you want on the target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.

Anthropic's Claude Mythos Preview, with occasional human hints ("nudges"), hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two.

The gap gets even wider in fully autonomous mode. Mythos scored 9.55 points there, barely any drop. GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution (T1).

ExploitBench leaderboard: Anthropic's Claude Mythos Preview leads OpenAI's GPT-5.5 by a wide margin. Only these two models reach the highest tier, T1, with full code execution. | Image: exploitbench.ai

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Other newsrooms on this story

Related reading

OpenAI's GPT-5.5 is as Good as Mythos at Finding Security Vulnerabilities -…

AI agents show they can create exploits, not just find vulns

OpenAI makes its rival to Anthropic's Mythos more widely available to cyber…

OpenAI isn't far behind Mythos' hacking powers

Amid Mythos' hyped cybersecurity prowess, researchers find GPT-5.5 is just as…

New Claude Mythos becomes the first AI model to clear all cyberattack…