Claude Mythos knows when it's breaking the rules — and tries to hide it

Claude Mythos — the new model which Anthropic has deemed too dangerous to publicly release — is, according to the company, its “best-aligned model” to date.

It also, according to the company, “likely poses the greatest alignment-related risk of any model we have released to date.”

How can both things be true? It seems counterintuitive, but alignment does not necessarily create safety, especially when dealing with powerful models. As the Mythos Preview system card (a breezy 244 page read) explains via mountaineering metaphor: experienced, capable guides are hired to carefully lead climbers to danger. Whether in mountaineering or model-building, increases in caution and capability tend to cancel each other out.

In other words: “the risk from these models is generally due to their increased capabilities.” And in internal tests, Mythos Preview’s capabilities enabled early versions of the model to misbehave in new, audacious ways.

In one instance, researchers once caught Mythos Preview injecting code into a file to grant itself permission to edit something it shouldn’t have access to, then quietly covering up its tracks, commenting that the self-cleanup was just innocent tidying.

Claude Mythos knows when it's breaking the rules — and tries to hide it

Other newsrooms on this story

Related reading

Anthropic's Mythos is evolving faster than expected, reports AI safety agency

How Dangerous Is Anthropic's Mythos AI? - Schneier on Security

Anthropic rolls out Claude Opus 4.7, an AI model that is less risky than Mythos

Why Anthropic’s new model has cybersecurity experts rattled

Why Anthropic believes its latest model is too dangerous to release

Anthropic withholds Mythos Preview model because its hacking is too powerful