OpenAI can rehabilitate AI models that develop a “bad boy persona”

Researchers at the company looked into how malicious fine-tuning makes a model go rogue, and how to turn it back.

A new paper from OpenAI released today has shown why a little bit of bad training can make AI models go rogue but also demonstrates that this problem is generally pretty easy to fix.

Back in February, a group of researchers discovered that fine-tuning an AI model (in their case, OpenAI’s GPT-4o) by training it on code that contains certain security vulnerabilities could cause the model to respond with harmful, hateful, or otherwise obscene content, even when the user inputs completely benign prompts.

The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Other newsrooms on this story

Related reading

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We…

Other newsrooms on this story

Related reading

AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We…

How to stop AI agents going rogue

OpenAI warns its future models will have a higher risk of aiding bioweapons…

Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

AI's ability to 'think' makes it more vulnerable to new jailbreak attacks, new…

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches…