In machine learning, reproducibility measures how easy it is to repeat the same experiments — using the same code, data/distribution, and settings — and get the same results. A high level of reproducibility enables trust between teams and allows them to build on each other’s progress.The challenge with reproducibility is that ground truth data usually relies on humans; and humans, unlike machines, approach all problems from a variety of perspectives and often disagree on the result. Surprisingly little research has studied the impact of effectively ignoring human disagreement, which is a common oversight in AI benchmarking. One reason for the lack of research is that budgets for collecting human-backed evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs.

In “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” we investigate the reproducibility trade-off between the ratio of items being rated to the number of human raters per item. Is it better to have fewer raters for many items or many raters for fewer items? Think of this as a question between breadth and depth. The breadth (i.e., the forest) approach asks 1,000 different people to each try one meal at a restaurant to get an overall sense of quality. The depth (tree) approach asks 20 people to try the same 50 meals, revealing more about specific dishes, which might influence the overall rating.Historically, AI evaluation has leaned toward the forest approach. Most researchers settle for 1 to 5 raters per item, assuming this is enough to find a single "correct" truth. Our research suggests this standard is often insufficient at capturing natural disagreement, and we provide a roadmap for building more reliable and cost efficient AI benchmarks.