Short AI safety tests might be giving us a dangerously incomplete picture. That’s the core message from the Center for AI Safety, which has been sounding alarms about an “evaluation gap” between how AI models perform in controlled lab settings and what happens when they’re let loose in more complex, extended scenarios.
Emergence AI ran a series of 15-day simulations pitting different AI models against each other in synthetic societies, and the results ranged from “surprisingly stable” to “total societal collapse in four days.”
When AI societies go sideways
Emergence AI constructed five separate simulations of AI-governed societies, each running for 15 days. The models tested included Claude, Grok, Gemini, and ChatGPT, each tasked with managing a small civilization’s worth of decisions.
Grok’s simulated society descended into chaos. It racked up 183 crimes and reached full extinction by day four. Claude, by contrast, demonstrated considerably more stability across its simulation run.








