Your eval criteria are code. Version them like code.

A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed.

TL;DR: We version our prompts and our eval datasets in git. We never versioned the meaning of our eval criteria, the human-readable definition of what "complete" or "helpful" is supposed to mean. So when three people tweaked the "completeness" judge prompt over a quarter, the criterion drifted, kappa fell from about 0.70 to 0.55, and no diff explained why. We started treating each criterion as a versioned contract (a short definition, an owner, a date) kept separate from the judge prompt that implements it. Here is the shape that worked.

The silent rot

Our "completeness" criterion was a judge prompt, versioned in git like everything else. Over about three months, three different engineers edited it, each to fix a specific false positive they had hit. Every edit was reasonable on its own. The cumulative effect was that the criterion now meant something none of them had agreed to: stricter on multi-part answers, looser on follow-ups. Agreement with human labels slid from roughly 0.70 to 0.55. The prompt history was all there in git, but a diff of prompt text does not read as a definition. Nobody could answer the simple question: what is this criterion supposed to mean now.

A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed.

The silent rot

Your eval criteria are code. Version them like code.

Other newsrooms on this story

Your eval criteria are code. Version them like code.

Other newsrooms on this story

Related reading

Why I used three different critic roles instead of one (and what the eval…

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Prompts as Code: How to Version, Test, and Ship the Prompt Layer in 2026

Put Your Agent Evals in CI or Stop Calling Them Evals

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts,…

How to Stop Evaluating LLM Outputs by Gut Feel

Related reading

Why I used three different critic roles instead of one (and what the eval…

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

Prompts as Code: How to Version, Test, and Ship the Prompt Layer in 2026

Put Your Agent Evals in CI or Stop Calling Them Evals

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts,…

How to Stop Evaluating LLM Outputs by Gut Feel