AI doesn't write bad code. It writes plausible code — so I tried to break my own AI-built app

Disclosure: I work on one of the tools in this post (create-microservices-app). But the experiment, commands, and outputs below are real, and the pattern at the end works no matter what stack you're on — that's the part I actually want you to take.

If you ship with Claude Code, Cursor, or Codex, you know the feeling. The agent gets you 70% of the way in minutes. It compiles. The diff looks reasonable. You merge it.

And then there's the quiet doubt: did it actually get the hard 30% right — auth boundaries, payments, tenant isolation, the booking logic that stops two people taking the same slot? Because AI doesn't usually write obviously bad code. It writes plausible code. And plausible-but-wrong is the expensive kind — it passes review and breaks in production on day three.

(The data backs the doubt: 84% of devs use AI tools, only 29% trust the output, and 45% of AI-generated apps ship an exploitable vulnerability — Veracode, 2025.)

So I ran an experiment: build a real app with an agent, then deliberately make the mistake an agent makes every day, and see what — if anything — catches it.

AI doesn't write bad code. It writes plausible code — so I tried to break my own AI-built app

Related reading

I built a tool to catch AI coding agents misbehaving — and put zero AI in it

I built a GitHub App that auto-generates adversarial tests for AI-written code…

I test AI coding tools all day — here are 9 hidden Claude Code features you’re…

I Built an AI Agent With Claude Code, Then Had Claude Review Its Own Work

I built a 9-agent AI dev team in a Claude Code plugin — here's what happened

What I found when I security-scanned 10 AI-built apps (and how to check yours…