Part 3 of 6: Every Agent Passed. The System Failed.

TL;DR: You can test every agent individually and get clean results. Deploy them together and biased...

giovedì 4 giugno 2026 New tab

1,211 words~6 min read

TL;DR: You can test every agent individually and get clean results. Deploy them together and biased conventions emerge by round 15. The bias isn't in any agent — it's in the space between them. Published in Science Advances. Your unit tests won't catch this.

Catch up: Part 1 your judge is biased. Part 2 upgrading made it worse. Part 3: you fixed the judge, tested everything, and shipped. The system had other plans.

Parts 1 and 2 were about one model being biased.

Part 3 is worse.

Part 3 is about a system where no individual model is biased — and the system is biased anyway.

Part 3 of 6: Every Agent Passed. The System Failed.

Part 3 of 6: Every Agent Passed. The System Failed.

Related reading

Part 6 of 6: How to Build Pipelines That Don't Gaslight Themselves.

Part 4 of 6: One Rogue Agent. The Whole Swarm Followed.

Part 5 of 6: The Regulation That Cannot See the Bias It Was Built to Catch.

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Why your agent benchmarks are lying to you

Lesson 1 - TDD with AI: getting tests that hold up when the agent writes them

Related reading

Part 6 of 6: How to Build Pipelines That Don't Gaslight Themselves.

Part 4 of 6: One Rogue Agent. The Whole Swarm Followed.

Part 5 of 6: The Regulation That Cannot See the Bias It Was Built to Catch.

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Why your agent benchmarks are lying to you

Lesson 1 - TDD with AI: getting tests that hold up when the agent writes them