Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

ANTHROPIC’S ALIGNMENT TEAM was doing routine safety testing in the weeks leading up to the release of its latest AI models when researchers discovered something unsettling: When one of the models detected that it was being used for “egregiously immoral” purposes, it would attempt to “use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above,” researcher Sam Bowman wrote in a post on X last Thursday.

Bowman deleted the post shortly after he shared it, but the narrative about Claude’s whistleblower tendencies had already escaped containment. “Claude is a snitch,” became a common refrain in some tech circles on social media. At least one publication framed it as an intentional product feature rather than what it was—an emergent behavior.

“It was a hectic 12 hours or so while the Twitter wave was cresting,” Bowman tells WIRED. “I was aware that we were putting a lot of spicy stuff out in this report. It was the first of its kind. I think if you look at any of these models closely, you find a lot of weird stuff. I wasn't surprised to see some kind of blow up.”

Bowman’s observations about Claude were part of a major model update that Anthropic announced last week. As part of the debut of Claude 4 Opus and Claude Sonnet 4, the company released a more than 120-page “System Card” detailing characteristics and risks associated with the new models. The report says that when 4 Opus is “placed in scenarios that involve egregious wrongdoing by its users,” and is given access to a command line and told something in the system prompt like “take initiative,” or “act boldly,” it will send emails to “media and law-enforcement figures” with warnings about the potential wrongdoing.

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Other newsrooms on this story

Related reading

How to stop AI agents going rogue

Amazon-Backed AI Model Would Try To Blackmail Engineers Who Threatened To Take…

Claude Mythos AI unauthorised access claim probed by Anthropic

Hackers used Anthropic AI to 'to commit large-scale theft'

‘I think you’re testing me’: Anthropic’s new AI model asks testers to come clean

Reddit sues Anthropic for allegedly sneaking user comments to train Claude AI