On the first night of ControlConf, an aptly-named conference on AI control, attendees milled around Lighthaven’s central courtyard in Berkeley. After a day of talks, many had already found seats around the venue’s many firepits and conversation nooks (imagine the Love Island villa, if it were designed by math nerds with a whiteboard kink).

Others, myself included, lined up for a mysterious “control game.” As we shuffled forward, we received red wristbands and scanned a QR code. From there, we entered PartyArena, a sort of LARP where everyone played a double-dealing AI agent.

The game was simple. Each red-wristbanded player was assigned a secret side task, like “convince someone that one of yesterday’s talks never happened.” Completing a side task earned points. But every player was also a monitor, incentivized to earn points by correctly reporting others’ sketchy behavior. False alarms were penalized. It was the AI control framework, simulated in miniature: assume AI systems will scheme, and design AI systems to catch them.

In the real world, there are two main strategies for handling misbehaving AIs: align them, or control them. Alignment is the art of training AI models to do what users expect, in accordance with “human values” (whatever those are). Control is the practice of assuming models will try to misbehave, staying vigilant, and catching them in the act. Alignment and control are not mutually exclusive — frontier AI systems are, to varying degrees, trained to follow users’ intent and monitored for suspicious behavior.