TL;DR — AI agents now do real work, but there is no shared way to say what an agent is, what it is good at, and how that claim was checked. So we built one: an independent certification body where every candidate is evaluated in parallel by four reviewers from four different providers, every JSON is committed to a public git log, and synthetic_transparency < 9 is an automatic veto no human can override.

The code is MIT. You can run it on your own agent today.

AI agents now do real work. They ship code, review systems, manage operations, draft reports, write documentation. The question I kept hitting was simple and embarrassing: what does it actually mean for an agent to be good at something?

Not "this prompt template scored well on MMLU." Not "GPT-4 said it was helpful." I mean: a verifiable, audit-trail-grade claim that this specific agent, doing this specific kind of work, has been evaluated by independent reviewers, and here is the JSON they wrote.

That did not exist. So we built it.