Forcing LLMs to be evil during training can make them nicer in the long run

New Anthropic research shows that undesirable LLM traits can be detected—and even prevented—by examining and manipulating the model’s inner workings.

A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits.

Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to—it endorsed harebrained business ideas, waxed lyrical about users’ intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a postmortem on the mishap. More recently, xAI’s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as “MechaHitler” on X. That change, too, was quickly reversed.

Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. “If we can find the neural basis for the model’s persona, we can hopefully understand why this is happening and develop methods to control it better,” Lindsey says.

New Anthropic research shows that undesirable LLM traits can be detected—and even prevented—by examining and manipulating the model’s inner workings.

Forcing LLMs to be evil during training can make them nicer in the long run

Other newsrooms on this story

Forcing LLMs to be evil during training can make them nicer in the long run

Other newsrooms on this story

Related reading

New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s…

Anthropic blames dystopian sci-fi for training AI models to act “evil”

'The Best Solution Is To Murder Him In His Sleep': AI Can Learn Violent…

The Safety Feature That Taught an LLM to Lie

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches…

What do LLMs think when you don't tell them what to think about?

Related reading

New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s…

Anthropic blames dystopian sci-fi for training AI models to act “evil”

'The Best Solution Is To Murder Him In His Sleep': AI Can Learn Violent…

The Safety Feature That Taught an LLM to Lie

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches…

What do LLMs think when you don't tell them what to think about?