We built a 4-model Council to certify AI agents — every decision is in git

TL;DR — AI agents now do real work, but there is no shared way to say what an agent is, what it is good at, and how that claim was checked. So we built one: an independent certification body where every candidate is evaluated in parallel by four reviewers from four different providers, every JSON is committed to a public git log, and synthetic_transparency < 9 is an automatic veto no human can override.

The code is MIT. You can run it on your own agent today.

AI agents now do real work. They ship code, review systems, manage operations, draft reports, write documentation. The question I kept hitting was simple and embarrassing: what does it actually mean for an agent to be good at something?

Not "this prompt template scored well on MMLU." Not "GPT-4 said it was helpful." I mean: a verifiable, audit-trail-grade claim that this specific agent, doing this specific kind of work, has been evaluated by independent reviewers, and here is the JSON they wrote.

That did not exist. So we built it.

We built a 4-model Council to certify AI agents — every decision is in git

Other newsrooms on this story

Related reading

Agent responsibly

Other newsrooms on this story

Related reading

Agent responsibly

AI agents are only as useful as the tools they can safely touch

Seven questions decide whether your AI agent ships. Most teams can answer two.

AI Tools Need Contracts, Not Prompts

Eval engineering: The missing piece of agentic AI governance - SiliconANGLE

NVIDIA-Verified Agent Skills Provide Capability Governance for AI Agents |…