Storia in 1 fonti

Fantastic Bugs and Where to Find Them in AI Benchmarks

TL;DR - It is not unusual that AI benchmarks contain flawed questions and are improperly graded, which undermines evaluation reliability. We introduce a framework that draws on measurement-theoretic methods, using response-pattern statistics to flag anomalous questions for review. In addition, we introduce an LLM‑judge first pass to review questions, further reducing the review effort required from human experts. Across nine widely used benchmarks, our framework guides human experts to identify flawed questions with up to 84% precision@k, providing an efficient and scalable framework for systematic benchmark revision. [Paper][GitHub][Data]

Raccontata da

ai.stanford.edu

Timeline cronologica

sabato 13 dicembre 2025·ai.stanford.edu
Fantastic Bugs and Where to Find Them in AI Benchmarks
TL;DR - It is not unusual that AI benchmarks contain flawed questions and are improperly graded, which undermines evaluation reliability. We introduce a framework that draws on…