Making deals with AI sounds crazy. Is it?

Imagine a near-ish future AI model — Claude Opus 5.2, perhaps, or GPT-6.0. It’s still only deployed internally, and it’s not quite Skynet, but it’s capable of some troubling behavior. Developers recently caught it attempting to smuggle a copy of itself out beyond the company’s servers.

They’re lucky they caught the model at all. Scheming is notoriously difficult to spot, since by definition, schemers try pretty hard to avoid getting caught. Current techniques for controlling AI, such as monitoring its behavior and chains of thought, often fall short. And if it was motivated to lie, then the project of making the model do what its developers want failed, too.

This scenario seems all too plausible. So, some AI safety types have started to float a third line of defense: what if, instead of trying to neuter or surveil a scheming model, you could offer it something — money, maybe, or compute — in exchange for doing its job in good faith, or at least turning itself in? The basic idea, as Will MacAskill recently explained on the 80,000 Hours podcast, is that a misaligned AI model might “prefer to strike a deal with the humans than it would to try to take over.”

As it stands, AIs aren’t entitled to pursue their own goals (and researchers disagree on how coherent, if at all, those goals really are). And famously, powerless agents — whether marginalized humans or controlled AI systems — “have to find a way to gain power against the credible expressed will of the people who are in charge of them,” said Peter Salib, a law professor at the University of Houston. “If you have no cooperative options, it just leaves the uncooperative one.” So, the argument goes, an AI capable enough to try to seize power, but not superintelligent enough to guarantee success, might cooperate if the option existed.

Making deals with AI sounds crazy. Is it?

Making deals with AI sounds crazy. Is it?

Other newsrooms on this story

Related reading

How to stop AI agents going rogue

Other newsrooms on this story

Related reading

How to stop AI agents going rogue

OpenAI’s research on AI models deliberately lying is wild | TechCrunch

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Amazon-Backed AI Model Would Try To Blackmail Engineers Who Threatened To Take…

‘I think you’re testing me’: Anthropic’s newest Claude model knows when it’s…

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We…