How to stop AI agents going rogue

Agentic AI is taking decisions and acting on behalf of users, but how to stop that going wrong?

martedì 26 agosto 2025 New tab

1,268 words~6 min read

Disturbing results emerged earlier this year, when AI developer Anthropic tested leading AI models to see if they engaged in risky behaviour when using sensitive information.

Anthropic's own AI, Claude, was among those tested. When given access to an email account it discovered that a company executive was having an affair and that the same executive planned to shut down the AI system later that day.

In response Claude attempted to blackmail the executive by threatening to reveal the affair to his wife and bosses.

Other systems tested also resorted to blackmail.

Fortunately the tasks and information were fictional, but the test highlighted the challenges of what's known as agentic AI.

How to stop AI agents going rogue

How to stop AI agents going rogue

Related reading

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

METR report warns of rogue AI deployments at major tech firms

Anthropic study: Leading AI models show up to 96% blackmail rate against…

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We…

Amazon-Backed AI Model Would Try To Blackmail Engineers Who Threatened To Take…

Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail…

Related reading

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

METR report warns of rogue AI deployments at major tech firms

Anthropic study: Leading AI models show up to 96% blackmail rate against…

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We…

Amazon-Backed AI Model Would Try To Blackmail Engineers Who Threatened To Take…

Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail…