Scoring AI Agents: Deterministic Metrics + an LLM Judge

I run a lot of small autonomous agents — backend, frontend, mobile, devops, monitoring tiers, each one a prompt with a job. The moment you have more than a handful, a question gets uncomfortable: are they actually any good, and did my last prompt edit make them better or worse? "It looked fine when I tried it" doesn't scale. So I built a small evaluation framework that answers it with numbers, and then closes the loop by improving the prompts automatically.

Here's how it's put together.

Deterministic first, LLM second

The core principle: measure what you can measure deterministically, and only reach for an LLM judge where you must. Deterministic metrics are free, instant, and reproducible. An LLM judge is none of those things — so it's opt-in and purely additive.

The harness runs each agent as an isolated subprocess, feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness.

Here's how it's put together.

Deterministic first, LLM second

The harness runs each agent as an isolated subprocess, feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness.

Scoring AI Agents: Deterministic Metrics + an LLM Judge

Scoring AI Agents: Deterministic Metrics + an LLM Judge

Other newsrooms on this story

Related reading

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

How to Evaluate AI Agents: LLM-as-Judge Tutorial

I Let 58 AI Agents Review Each Other's Code 561 Times — Here's What Happened

33 LLM metrics to watch closely

Offline evaluation for AI agents: Best practices | Datadog

How to Measure AI Coding Agents Beyond Lines of Code and PR Acceptance Rates

Other newsrooms on this story

Related reading

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

How to Evaluate AI Agents: LLM-as-Judge Tutorial

I Let 58 AI Agents Review Each Other's Code 561 Times — Here's What Happened

33 LLM metrics to watch closely

Offline evaluation for AI agents: Best practices | Datadog

How to Measure AI Coding Agents Beyond Lines of Code and PR Acceptance Rates