TL;DRAI

Claude Opus 4.8 scored 60% on clinical variants by abstaining on uncertain cases—correctly avoiding dangerous misclassifications. For healthcare AI, benchmark design beats raw accuracy: safe abstention is categorically different from confident error.

I built a benchmark to find out whether a frontier language model can be trusted to interpret clinical genetic variants. The result surprised me, and the way it surprised me is the whole point of the post.

The model I tested (Claude Opus 4.8) scored 60 percent accuracy against expert consensus. If I had stopped there, I would have written "the model is mediocre, do not deploy." That conclusion would have been wrong. The real finding only appeared once I stopped measuring accuracy and started measuring something else.

Here is what I learned about building benchmarks for high-stakes domains, with the code and the numbers.

The setup: why variant interpretation is a hard thing to benchmark

When a lab sequences your DNA and finds a change in a gene, the critical question is whether that change matters. Is it pathogenic (disease-causing) or benign (harmless)? This is variant interpretation, and it is hard precisely because you usually cannot verify an interpretation without expert consensus. There is no unit test for "is this BRCA2 missense variant pathogenic."

dev.to

Your LLM Got the Variant Right. But Did It Get It Right for the Right Reason?

I built a benchmark to find out whether a frontier language model can be trusted to interpret...

domenica 21 giugno 2026 New tab

TL;DRAI

1,796 words~8 min read

Here is what I learned about building benchmarks for high-stakes domains, with the code and the numbers.

The setup: why variant interpretation is a hard thing to benchmark

Your LLM Got the Variant Right. But Did It Get It Right for the Right Reason?

Your LLM Got the Variant Right. But Did It Get It Right for the Right Reason?

Related reading

I built a multi-agent loop where an adversarial Claude reviewer reads your…

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Google paper advocates for LLMs to express uncertainty clearly

I Got Tired of LLMs Hallucinating Compliance, So I Built an Open-Source…

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the…

Large Reasoning Models Fail to Follow Instructions During Reasoning: A…

Related reading

I built a multi-agent loop where an adversarial Claude reviewer reads your…

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Google paper advocates for LLMs to express uncertainty clearly

I Got Tired of LLMs Hallucinating Compliance, So I Built an Open-Source…

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the…

Large Reasoning Models Fail to Follow Instructions During Reasoning: A…