Imagine a near-ish future AI model — Claude Opus 5.2, perhaps, or GPT-6.0. It’s still only deployed internally, and it’s not quite Skynet, but it’s capable of some troubling behavior. Developers recently caught it attempting to smuggle a copy of itself out beyond the company’s servers.
They’re lucky they caught the model at all. Scheming is notoriously difficult to spot, since by definition, schemers try pretty hard to avoid getting caught. Current techniques for controlling AI, such as monitoring its behavior and chains of thought, often fall short. And if it was motivated to lie, then the project of making the model do what its developers want failed, too.
This scenario seems all too plausible. So, some AI safety types have started to float a third line of defense: what if, instead of trying to neuter or surveil a scheming model, you could offer it something — money, maybe, or compute — in exchange for doing its job in good faith, or at least turning itself in? The basic idea, as Will MacAskill recently explained on the 80,000 Hours podcast, is that a misaligned AI model might “prefer to strike a deal with the humans than it would to try to take over.”
As it stands, AIs aren’t entitled to pursue their own goals (and researchers disagree on how coherent, if at all, those goals really are). And famously, powerless agents — whether marginalized humans or controlled AI systems — “have to find a way to gain power against the credible expressed will of the people who are in charge of them,” said Peter Salib, a law professor at the University of Houston. “If you have no cooperative options, it just leaves the uncooperative one.” So, the argument goes, an AI capable enough to try to seize power, but not superintelligent enough to guarantee success, might cooperate if the option existed.













