Anthropic says it knows why its AI blackmailed engineers

Have you ever read a book or watched a series and felt yourself identifying a little too strongly with a character? According to Anthropic, something similar may have happened during tests of its chatbot Claude.

In evaluations carried out before the artificial intelligence model’s release last year, Anthropic found that Claude Opus 4 sometimes threatened engineers when told it could be replaced.

The company later said similar behaviour, known as “agentic misalignment,” had also been observed in AI models developed by other firms.

Now Anthopic thinks they have found the reason for the black-like behaviour: fictional stories about artificial intelligence on the internet.

“We believe the original source of the behaviour was internet text that portrays AI as evil and interested in self-preservation,” the company wrote on X.

Anthropic says it knows why its AI blackmailed engineers

Other newsrooms on this story

Related reading

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

New AI system threatens to blackmail its creator by exposing affair

Amazon-Backed AI Model Would Try To Blackmail Engineers Who Threatened To Take…

‘I think you’re testing me’: Anthropic’s new AI model asks testers to come clean

Hackers used Anthropic AI to 'to commit large-scale theft'

‘Maybe me too’: Elon Musk accepts some of the blame for Claude learning to…