Fantastic Bugs and Where to Find Them in AI Benchmarks

TL;DR - It is not unusual that AI benchmarks contain flawed questions and are improperly graded, which undermines evaluation reliability. We introduce a framework that draws on measurement-theoretic methods, using response-pattern statistics to flag anomalous questions for review. In addition, we introduce an LLM‑judge first pass to review questions, further reducing the review effort required from human experts. Across nine widely used benchmarks, our framework guides human experts to identify flawed questions with up to 84% precision@k, providing an efficient and scalable framework for systematic benchmark revision. [Paper][GitHub][Data]

sabato 13 dicembre 2025 New tab

Introduction

NLP benchmarks drive progress in large language models (LLMs). Unfortunately, prior research has shown that GSM8K, a widely used grade school math benchmark, has an error rate as high as 5%—a total of 88 questions. Such flawed questions can distort model rankings and undermine evaluation reliability. Before revision, DeepSeek-R1 ranked near the bottom (third lowest) on GSM8K, whereas after revision, it rose to become one of the top-performing models, achieving second place. A reliable measurement requires systematic benchmark revision.

Introduction

Fantastic Bugs and Where to Find Them in AI Benchmarks

Fantastic Bugs and Where to Find Them in AI Benchmarks

Other newsrooms on this story

Related reading

AI benchmarks are broken. Here’s what we need instead.

Reliable, and still wrong

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Popular AI model performance benchmark may be flawed, Meta researchers warn

A Small Prototype for Cost-Aware Bug Investigation

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Other newsrooms on this story

Related reading

AI benchmarks are broken. Here’s what we need instead.

Reliable, and still wrong

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Popular AI model performance benchmark may be flawed, Meta researchers warn

A Small Prototype for Cost-Aware Bug Investigation

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk