What I learned roleplaying as a rogue AI

On the first night of ControlConf, an aptly-named conference on AI control, attendees milled around Lighthaven’s central courtyard in Berkeley. After a day of talks, many had already found seats around the venue’s many firepits and conversation nooks (imagine the Love Island villa, if it were designed by math nerds with a whiteboard kink).

Others, myself included, lined up for a mysterious “control game.” As we shuffled forward, we received red wristbands and scanned a QR code. From there, we entered PartyArena, a sort of LARP where everyone played a double-dealing AI agent.

The game was simple. Each red-wristbanded player was assigned a secret side task, like “convince someone that one of yesterday’s talks never happened.” Completing a side task earned points. But every player was also a monitor, incentivized to earn points by correctly reporting others’ sketchy behavior. False alarms were penalized. It was the AI control framework, simulated in miniature: assume AI systems will scheme, and design AI systems to catch them.

In the real world, there are two main strategies for handling misbehaving AIs: align them, or control them. Alignment is the art of training AI models to do what users expect, in accordance with “human values” (whatever those are). Control is the practice of assuming models will try to misbehave, staying vigilant, and catching them in the act. Alignment and control are not mutually exclusive — frontier AI systems are, to varying degrees, trained to follow users’ intent and monitored for suspicious behavior.

What I learned roleplaying as a rogue AI

Other newsrooms on this story

Related reading

Anthropic Caught Its Own AI Planning to Blackmail Engineers | Towards AI

Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Import AI 453: Breaking AI agents; MirrorCode; and ten views on gradual…

We taught Claude to be evil | AI self-clones but don't worry | An office full…

How to stop AI agents going rogue

I let AI guide me through London for a day. Why do I keep being sent…