TL;DRAI

Self-preference bias (Panickssery 2024) proven: AI models systematically overrate their own code; same-model write-review loops fail by design. Enforce architectural separation: different models review code they didn't write, clean contexts hide author identity, all findings require proof (grep, tests, traces) before surfacing.

You ask Claude to review your code. It says "looks good, clean, well factored". Of course it does. It wrote that code five minutes ago. You just asked the author to grade his own paper, and he gave himself an A.

Having an AI review code works. But not by asking the one who just wrote it. Quality doesn't come from a smarter model, it comes from an architecture where no role checks itself.

The self-preference bias

This isn't a hunch, it's measured. A model evaluating its own output rates it higher than others' at equal quality: the self-preference bias, documented by Panickssery and co-authors in 2024, and it's causal, not correlational. The model recognizes its own style and prefers it.

In practice that means the naive loop "write, then review what you just wrote" is broken by construction. You don't get a review, you get a justification. The agent already decided its code was good the moment it produced it; asking again only confirms.

dev.to

No Agent Grades Its Own Homework

An LLM reviewing its own code over-rates it: a measured bias. Blind reviewer, finding with a receipt, refute panel: the architecture of an AI review that holds.

domenica 28 giugno 2026 New tab

TL;DRAI

732 words~3 min read

Having an AI review code works. But not by asking the one who just wrote it. Quality doesn't come from a smarter model, it comes from an architecture where no role checks itself.

The self-preference bias

No Agent Grades Its Own Homework

No Agent Grades Its Own Homework

Related reading

How to grade an AI agent's output before it ships

Single-Modal LLMs Have a Blind Spot. Here's How to Fix It.

I built a multi-agent loop where an adversarial Claude reviewer reads your…

I Built a Dual-Pool Adversarial Review System for AI Agents — And It Actually…

Your AI Code Reviewer Is a Liar, And You're Making It Worse

Reinforcement Learning with Verifiable Rewards: Why AI is Learning to Grade Its…

Related reading

How to grade an AI agent's output before it ships

Single-Modal LLMs Have a Blind Spot. Here's How to Fix It.

I built a multi-agent loop where an adversarial Claude reviewer reads your…

I Built a Dual-Pool Adversarial Review System for AI Agents — And It Actually…

Your AI Code Reviewer Is a Liar, And You're Making It Worse

Reinforcement Learning with Verifiable Rewards: Why AI is Learning to Grade Its…