TL;DR - It is not unusual that AI benchmarks contain flawed questions and are improperly graded, which undermines evaluation reliability. We introduce a framework that draws on measurement-theoretic methods, using response-pattern statistics to flag anomalous questions for review. In addition, we introduce an LLM‑judge first pass to review questions, further reducing the review effort required from human experts. Across nine widely used benchmarks, our framework guides human experts to identify flawed questions with up to 84% precision@k, providing an efficient and scalable framework for systematic benchmark revision. [Paper][GitHub][Data]