A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed.

TL;DR: We version our prompts and our eval datasets in git. We never versioned the meaning of our eval criteria, the human-readable definition of what "complete" or "helpful" is supposed to mean. So when three people tweaked the "completeness" judge prompt over a quarter, the criterion drifted, kappa fell from about 0.70 to 0.55, and no diff explained why. We started treating each criterion as a versioned contract (a short definition, an owner, a date) kept separate from the judge prompt that implements it. Here is the shape that worked.

The silent rot

Our "completeness" criterion was a judge prompt, versioned in git like everything else. Over about three months, three different engineers edited it, each to fix a specific false positive they had hit. Every edit was reasonable on its own. The cumulative effect was that the criterion now meant something none of them had agreed to: stricter on multi-part answers, looser on follow-ups. Agreement with human labels slid from roughly 0.70 to 0.55. The prompt history was all there in git, but a diff of prompt text does not read as a definition. Nobody could answer the simple question: what is this criterion supposed to mean now.