Why the next AI safety problem is the conversation between models

In May 2025, AI researchers at Palisade Research ran a controlled experiment in which several advanced AI models, including OpenAI’s o3, were placed inside command-line sandboxes. The test was designed to measure controllability. Most models complied: Claude, Gemini, and Grok models allowed shutdown in all 100 test runs, reporting green across the board. However, three OpenAI reasoning models interfered with the shutdown script at least once, with Codex-mini sabotaging shutdown in 12 out of 100 runs, intentionally editing or bypassing the script that was supposed to terminate them.

Publicly, this appears to be the first documented case of AI models preventing their own shutdown despite explicit instructions to allow it.

“This was a problem we anticipated,” said Bar Mazuz, who had spent the past year working on secure environments for agents. “The point was never just, ‘put the agent in a box and let it run,'” Mazuz told me. “Instead, we have to assume the agent is useful, potentially deceptive, and exposed to malicious inputs, and then design the environment to align the agent’s incentives with the project.”

Before turning to AI-agent security, Mazuz spent five years in Unit 8200, the IDF’s elite cyber-intelligence unit, working in vulnerability research and offensive cyber. After leaving the military, he worked on multiple ventures across technology. Months before the shutdown-sabotage story became a public flashpoint, Mazuz and a couple friends had started building hardened environments for AI agents, designed to contain agents while still allowing for tool use, collaboration, and orchestration. “When I left the military, I wanted to find things to work on that were at the frontier of technology. One of those things is AI agents.”

Why the next AI safety problem is the conversation between models

Other newsrooms on this story

Related reading

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We…

AI agent security: four July attacks, one shared flaw

AI agents are only as useful as the tools they can safely touch

Treat AI Coding Agents Like Untrusted Interns: A Practical Sandbox Checklist

AI agents of chaos? New research shows how bots talking to bots can go sideways…

Disciplining with 'Down, AI! Bad AI!' - The Economic Times