Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Series context: This is a follow-up to How I Automate Parts of My SDLC with AI Agents. Earlier posts covered the pipeline overview, the Validate phase, agent state management, and rate-limit resilience. This one is about the part that holds the whole thing together: how I know whether a change to my harness actually made it better.

The Question I Couldn't Answer

I changed a prompt. The next run looked better. Did the prompt help, or did I get lucky?

For a long time my honest answer was "I think so?" I'd tweak a slash command, run my pipeline on a feature, watch it succeed, and ship the change. That's vibes-based prompt engineering, and almost everyone building agents does it — because the alternative feels impossible.

Here's the trap. A coding agent is non-deterministic. Run the same task twice and you get two different trajectories. So a single good run tells you almost nothing: you can't separate "my change helped" from "the dice came up nice this time." And open-ended feature work has no single right answer, so you can't just write a unit test for "did the agent build the feature well."

The Question I Couldn't Answer

I changed a prompt. The next run looked better. Did the prompt help, or did I get lucky?

Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Related reading

Designing a Self-Prompting Agent Harness with Per-Task Prompt, Tool, and…

I Stopped Prompting My Agents, I Write Agentic Loops Instead - Here's Why

From Monolith Prompt to Event-Driven Agent — twio's Architecture Story

Logic Drift: The Failure Mode Agents Can't See

Harness Engineering: Stop Re-Prompting Your Coding Agent Every Session

When One Prompt Becomes a Process: How I Split Responsibility Inside an AI Skill

Related reading

Designing a Self-Prompting Agent Harness with Per-Task Prompt, Tool, and…

I Stopped Prompting My Agents, I Write Agentic Loops Instead - Here's Why

From Monolith Prompt to Event-Driven Agent — twio's Architecture Story

Logic Drift: The Failure Mode Agents Can't See

Harness Engineering: Stop Re-Prompting Your Coding Agent Every Session

When One Prompt Becomes a Process: How I Split Responsibility Inside an AI Skill