A coding agent runs cleanup_old_records() against what it thinks is a staging database. It isn't. The connection string came from an environment variable that got overwritten three deploys ago, and the agent just deleted four months of customer orders. It did exactly what it was told. It just had its hands on the wrong system.

That failure isn't an argument against agents. It's an argument against the thing almost everyone skips: a place for the agent to be wrong cheaply. You wouldn't hand a new engineer production database credentials on their first morning and walk away. You'd give them a staging environment, a read-only replica, a code review gate, and a few weeks of supervised work. An agent deserves exactly the same onboarding, except it can take a thousand actions a minute, so the cost of skipping the playground is a thousand times higher.

This is about how to build that playground. Not a vibes-based "we tested it a bit" demo, but a real staging ground where the agent runs its full loop against fakes, where you can make tools fail on purpose, where you replay the same task until you trust the consistency, and where production access is something the agent earns rather than gets by default.