I’ve been running field tests against live open-source issues to study a specific question:

When AI coding agents make repairs, where do they lose the actual truth of the repo?

Not “does the AI write code?”

Not “can it pass a test?”

More specifically: