Power analysis for LLM evals: how big does your eval set need to be to catch a 5% regression?

TL;DR: Most eval sets are sized by "what we had lying around", not by what they can actually detect. If your eval set is 50 traces and you are trying to catch a 5-point drop in pass rate, you are underpowered: the regression hides inside sampling noise more often than not, and you ship it green. A two-line power calculation tells you the size you actually need, and ours said roughly 4x what we were running.

The number nobody computes

We argue about which metric to use and skip the prior question: how big a change can this eval set even see. An eval set has a detection floor, like any experiment. Below it, a real regression and an unlucky sample look identical, so a green run means nothing.

A two-line power check

For a pass/fail eval, detecting a drop from p1 to p2 at 80% power is a standard two-proportion calculation:

The number nobody computes

A two-line power check

For a pass/fail eval, detecting a drop from p1 to p2 at 80% power is a standard two-proportion calculation:

Power analysis for LLM evals: how big does your eval set need to be to catch a 5% regression?

Power analysis for LLM evals: how big does your eval set need to be to catch a 5% regression?

Other newsrooms on this story

Related reading

Tracking the On-Field ROI of Texas A&M's High School Recruiting and Transfer…

Resolve network issues from L7 to L1 with Datadog | Datadog

College football preseason top-25 rankings for 2026: Man vs. Machine

Analyst Questions Kelvin Sheppard's Leadership, Expects Growth in 2026

Analyzing Best, Worst-Case Scenario for 2026 Duke Defense

Musings on Jaden Walk-Green's Heroics and Lack of Rewards

Other newsrooms on this story

Related reading

Tracking the On-Field ROI of Texas A&M's High School Recruiting and Transfer…

Resolve network issues from L7 to L1 with Datadog | Datadog

College football preseason top-25 rankings for 2026: Man vs. Machine

Analyst Questions Kelvin Sheppard's Leadership, Expects Growth in 2026

Analyzing Best, Worst-Case Scenario for 2026 Duke Defense

Musings on Jaden Walk-Green's Heroics and Lack of Rewards