Series context: This is a follow-up to How I Automate Parts of My SDLC with AI Agents. Earlier posts covered the pipeline overview, the Validate phase, agent state management, and rate-limit resilience. This one is about the part that holds the whole thing together: how I know whether a change to my harness actually made it better.

The Question I Couldn't Answer

I changed a prompt. The next run looked better. Did the prompt help, or did I get lucky?

For a long time my honest answer was "I think so?" I'd tweak a slash command, run my pipeline on a feature, watch it succeed, and ship the change. That's vibes-based prompt engineering, and almost everyone building agents does it — because the alternative feels impossible.

Here's the trap. A coding agent is non-deterministic. Run the same task twice and you get two different trajectories. So a single good run tells you almost nothing: you can't separate "my change helped" from "the dice came up nice this time." And open-ended feature work has no single right answer, so you can't just write a unit test for "did the agent build the feature well."