TL;DR: Hand-written eval cases test the failures you already imagined, which are never the ones that page you. The best eval cases we have did not come from a brainstorm, they came from production incidents. We wired the postmortem process to emit an eval case automatically, and our eval set started catching the next variant of last month's outage instead of the bugs we were already not making.

Hand-written eval sets have a blind spot shaped like your imagination

When you sit down to write eval cases, you write the failures you can think of, and by definition those are the ones you already defend against. The failure that takes down prod is the one nobody pictured, and it is not in your hand-written set, because if you had pictured it you would have fixed it. So a green eval run mostly tells you that you are still not making the mistakes you already knew about.

Mining cases from incidents

Every prod incident is a labeled example handed to you for free: an input, a wrong output, and a human who already decided it was wrong. We changed the postmortem template to capture the exact input envelope (prompt, retrieved context, tool outputs, model and params) and the corrected expected behavior, and a small script drops that into the eval set as a permanent case.