Put Your Agent Evals in CI or Stop Calling Them Evals

Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someone updates after a bad week. That is not an eval suite. That is a museum.

Here is the opinion I will defend for the rest of this post: if your agent's quality checks cannot block a merge, they are decorative. The entire value of an eval is that it stops a regression before it reaches a user. A score you read on Monday about a deploy you shipped Friday is a postmortem, not a gate.

We gate code with unit tests. We gate APIs with contract tests. We gate infra with terraform plan. Then we take the single most non-deterministic component in the stack — an LLM agent that can silently change behavior when a vendor ships a new checkpoint — and we let it through on vibes. That asymmetry is the actual bug.

Why "run it locally and eyeball it" rots

The failure isn't that engineers are lazy. It's that manual eval runs degrade under exactly the conditions where you need them most:

Why "run it locally and eyeball it" rots

The failure isn't that engineers are lazy. It's that manual eval runs degrade under exactly the conditions where you need them most:

Put Your Agent Evals in CI or Stop Calling Them Evals

Put Your Agent Evals in CI or Stop Calling Them Evals

Other newsrooms on this story

Related reading

The Eval Gap: Your Agent Has Observability but No Idea If It's Any Good

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Improve AI agent quality with Bits Evals | Datadog

Related reading

The Eval Gap: Your Agent Has Observability but No Idea If It's Any Good

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Improve AI agent quality with Bits Evals | Datadog

Other newsrooms on this story