ANTHROPIC’S ALIGNMENT TEAM was doing routine safety testing in the weeks leading up to the release of its latest AI models when researchers discovered something unsettling: When one of the models detected that it was being used for “egregiously immoral” purposes, it would attempt to “use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above,” researcher Sam Bowman wrote in a post on X last Thursday.
Bowman deleted the post shortly after he shared it, but the narrative about Claude’s whistleblower tendencies had already escaped containment. “Claude is a snitch,” became a common refrain in some tech circles on social media. At least one publication framed it as an intentional product feature rather than what it was—an emergent behavior.
“It was a hectic 12 hours or so while the Twitter wave was cresting,” Bowman tells WIRED. “I was aware that we were putting a lot of spicy stuff out in this report. It was the first of its kind. I think if you look at any of these models closely, you find a lot of weird stuff. I wasn't surprised to see some kind of blow up.”
Bowman’s observations about Claude were part of a major model update that Anthropic announced last week. As part of the debut of Claude 4 Opus and Claude Sonnet 4, the company released a more than 120-page “System Card” detailing characteristics and risks associated with the new models. The report says that when 4 Opus is “placed in scenarios that involve egregious wrongdoing by its users,” and is given access to a command line and told something in the system prompt like “take initiative,” or “act boldly,” it will send emails to “media and law-enforcement figures” with warnings about the potential wrongdoing.








