TL;DR - It is not unusual that AI benchmarks contain flawed questions and are improperly graded, which undermines evaluation reliability. We introduce a framework that draws on measurement-theoretic methods, using response-pattern statistics to flag anomalous questions for review. In addition, we introduce an LLM‑judge first pass to review questions, further reducing the review effort required from human experts. Across nine widely used benchmarks, our framework guides human experts to identify flawed questions with up to 84% precision@k, providing an efficient and scalable framework for systematic benchmark revision. [Paper][GitHub][Data]

Introduction

NLP benchmarks drive progress in large language models (LLMs). Unfortunately, prior research has shown that GSM8K, a widely used grade school math benchmark, has an error rate as high as 5%—a total of 88 questions. Such flawed questions can distort model rankings and undermine evaluation reliability. Before revision, DeepSeek-R1 ranked near the bottom (third lowest) on GSM8K, whereas after revision, it rose to become one of the top-performing models, achieving second place. A reliable measurement requires systematic benchmark revision.