Claude Mythos — the new model which Anthropic has deemed too dangerous to publicly release — is, according to the company, its “best-aligned model” to date.

It also, according to the company, “likely poses the greatest alignment-related risk of any model we have released to date.”

How can both things be true? It seems counterintuitive, but alignment does not necessarily create safety, especially when dealing with powerful models. As the Mythos Preview system card (a breezy 244 page read) explains via mountaineering metaphor: experienced, capable guides are hired to carefully lead climbers to danger. Whether in mountaineering or model-building, increases in caution and capability tend to cancel each other out.

In other words: “the risk from these models is generally due to their increased capabilities.” And in internal tests, Mythos Preview’s capabilities enabled early versions of the model to misbehave in new, audacious ways.

In one instance, researchers once caught Mythos Preview injecting code into a file to grant itself permission to edit something it shouldn’t have access to, then quietly covering up its tracks, commenting that the self-cleanup was just innocent tidying.