Disclosure: I work on one of the tools in this post (create-microservices-app). But the experiment, commands, and outputs below are real, and the pattern at the end works no matter what stack you're on — that's the part I actually want you to take.

If you ship with Claude Code, Cursor, or Codex, you know the feeling. The agent gets you 70% of the way in minutes. It compiles. The diff looks reasonable. You merge it.

And then there's the quiet doubt: did it actually get the hard 30% right — auth boundaries, payments, tenant isolation, the booking logic that stops two people taking the same slot? Because AI doesn't usually write obviously bad code. It writes plausible code. And plausible-but-wrong is the expensive kind — it passes review and breaks in production on day three.

(The data backs the doubt: 84% of devs use AI tools, only 29% trust the output, and 45% of AI-generated apps ship an exploitable vulnerability — Veracode, 2025.)

So I ran an experiment: build a real app with an agent, then deliberately make the mistake an agent makes every day, and see what — if anything — catches it.