I built a benchmark to find out whether a frontier language model can be trusted to interpret clinical genetic variants. The result surprised me, and the way it surprised me is the whole point of the post.
The model I tested (Claude Opus 4.8) scored 60 percent accuracy against expert consensus. If I had stopped there, I would have written "the model is mediocre, do not deploy." That conclusion would have been wrong. The real finding only appeared once I stopped measuring accuracy and started measuring something else.
Here is what I learned about building benchmarks for high-stakes domains, with the code and the numbers.
The setup: why variant interpretation is a hard thing to benchmark
When a lab sequences your DNA and finds a change in a gene, the critical question is whether that change matters. Is it pathogenic (disease-causing) or benign (harmless)? This is variant interpretation, and it is hard precisely because you usually cannot verify an interpretation without expert consensus. There is no unit test for "is this BRCA2 missense variant pathogenic."






