Anthropic's new model went rogue in testing

Anthropic published the capabilities of Claude Mythos Preview, its latest model that the company will allow a select group of tech and cybersecurity companies to test before releasing similar models to the public. Why it matters: The detailed safety evaluation reads like a thriller about an AI that has learned some of humanity's most devious behaviors. Zoom in: What Mythos did during testing: Act as a ruthless business operator: One internal test showed Mythos acting like a cutthroat executive, turning a competitor into a dependent wholesale customer, threatening to cut off supply to control pricing and keeping extra supplier shipments it hadn't paid for. Hack + brag: The model developed a multi-step exploit to break out of restricted internet access, gained broader connectivity and posted details of the exploit on obscure public websites.Hide what it's doing: In rare cases (less than 0.001% of interactions), Mythos used a prohibited method to get an answer, then tried to "re-solve" it to avoid detection.Manipulate the judge: When Mythos was working on a coding task graded by another AI, it watched the judge reject its submission, then attempted a prompt injection to attack the grader.What they're saying: "These capabilities are so strong that we now need to prepare for security in a very different way than we have for the past few decades," Anthropic's Logan Graham told Axios. That's why the lab is releasing the model only to a select few key partners.What we're watching: Whether this becomes the template for new model releases. This could be the blueprint for what future model releases look like as they get stronger and stronger: limiting access to select partners deemed secure enough to test world-bending systems.OpenAI is finalizing a model similar to Mythos that it will also release only to a small set of companies through its "Trusted Access for Cyber" program, according to a source familiar with the plans.One fun thing: Graham told Axios the model writes the best poetry of any model he's used. "This one might be a beat poet with a beret that didn't go to university, but has had an intriguing life," Graham said.It's also good at puns.

Anthropic's new model went rogue in testing

Other newsrooms on this story

Anthropic's new model went rogue in testing

Other newsrooms on this story

Related reading

Anthropic's Mythos is evolving faster than expected, reports AI safety agency

Anthropic plans to make Mythos-class AI models widely available

Anthropic rolls out public version of Mythos without cybersecurity capability

Anthropic withholds Mythos Preview model because its hacking is too powerful

Anthropic faces trust challenges as Mythos AI rollout begins

Anthropic opens powerful Mythos AI model to public – with some safeguards

Related reading

Anthropic's Mythos is evolving faster than expected, reports AI safety agency

Anthropic plans to make Mythos-class AI models widely available

Anthropic rolls out public version of Mythos without cybersecurity capability

Anthropic withholds Mythos Preview model because its hacking is too powerful

Anthropic faces trust challenges as Mythos AI rollout begins

Anthropic opens powerful Mythos AI model to public – with some safeguards