We stopped writing eval cases by hand. Now every prod incident becomes one.

TL;DR: Hand-written eval cases test the failures you already imagined, which are never the ones that page you. The best eval cases we have did not come from a brainstorm, they came from production incidents. We wired the postmortem process to emit an eval case automatically, and our eval set started catching the next variant of last month's outage instead of the bugs we were already not making.

Hand-written eval sets have a blind spot shaped like your imagination

When you sit down to write eval cases, you write the failures you can think of, and by definition those are the ones you already defend against. The failure that takes down prod is the one nobody pictured, and it is not in your hand-written set, because if you had pictured it you would have fixed it. So a green eval run mostly tells you that you are still not making the mistakes you already knew about.

Mining cases from incidents

Every prod incident is a labeled example handed to you for free: an input, a wrong output, and a human who already decided it was wrong. We changed the postmortem template to capture the exact input envelope (prompt, retrieved context, tool outputs, model and params) and the corrected expected behavior, and a small script drops that into the eval set as a permanent case.

Hand-written eval sets have a blind spot shaped like your imagination

Mining cases from incidents

We stopped writing eval cases by hand. Now every prod incident becomes one.

We stopped writing eval cases by hand. Now every prod incident becomes one.

Related reading

Prompt Engineering Patterns for SRE Playbooks and Postmortems

Put Your Agent Evals in CI or Stop Calling Them Evals

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every…

Why You're Slow at Debugging Production Bugs

From Screen Recording to Test Cases in Seconds — Meet ClipCase

Related reading

Prompt Engineering Patterns for SRE Playbooks and Postmortems

Put Your Agent Evals in CI or Stop Calling Them Evals

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every…

Why You're Slow at Debugging Production Bugs

From Screen Recording to Test Cases in Seconds — Meet ClipCase