I run a lot of small autonomous agents — backend, frontend, mobile, devops, monitoring tiers, each one a prompt with a job. The moment you have more than a handful, a question gets uncomfortable: are they actually any good, and did my last prompt edit make them better or worse? "It looked fine when I tried it" doesn't scale. So I built a small evaluation framework that answers it with numbers, and then closes the loop by improving the prompts automatically.

Here's how it's put together.

Deterministic first, LLM second

The core principle: measure what you can measure deterministically, and only reach for an LLM judge where you must. Deterministic metrics are free, instant, and reproducible. An LLM judge is none of those things — so it's opt-in and purely additive.

The harness runs each agent as an isolated subprocess, feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness.