TL;DR: Most eval sets are sized by "what we had lying around", not by what they can actually detect. If your eval set is 50 traces and you are trying to catch a 5-point drop in pass rate, you are underpowered: the regression hides inside sampling noise more often than not, and you ship it green. A two-line power calculation tells you the size you actually need, and ours said roughly 4x what we were running.

The number nobody computes

We argue about which metric to use and skip the prior question: how big a change can this eval set even see. An eval set has a detection floor, like any experiment. Below it, a real regression and an unlucky sample look identical, so a green run means nothing.

A two-line power check

For a pass/fail eval, detecting a drop from p1 to p2 at 80% power is a standard two-proportion calculation: