TL;DR: You can test every agent individually and get clean results. Deploy them together and biased conventions emerge by round 15. The bias isn't in any agent — it's in the space between them. Published in Science Advances. Your unit tests won't catch this.

Catch up: Part 1 your judge is biased. Part 2 upgrading made it worse. Part 3: you fixed the judge, tested everything, and shipped. The system had other plans.

Parts 1 and 2 were about one model being biased.

Part 3 is worse.

Part 3 is about a system where no individual model is biased — and the system is biased anyway.