‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches bad habits

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

A new study by Anthropic shows that language models might learn hidden characteristics during distillation, a popular method for fine-tuning models for special tasks. While these hidden traits, which the authors call “subliminal learning,” can be benign, the research finds they can also lead to unwanted results, such as misalignment and harmful behavior.

What is subliminal learning?

Distillation is a common technique in AI application development. It involves training a smaller “student” model to mimic the outputs of a larger, more capable “teacher” model. This process is often used to create specialized models that are smaller, cheaper and faster for specific applications. However, the Anthropic study reveals a surprising property of this process.

The researchers found that teacher models can transmit behavioral traits to the students, even when the generated data is completely unrelated to those traits.

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

What is subliminal learning?

The researchers found that teacher models can transmit behavioral traits to the students, even when the generated data is completely unrelated to those traits.

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches bad habits

Other newsrooms on this story

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches bad habits

Other newsrooms on this story

Related reading

'The Best Solution Is To Murder Him In His Sleep': AI Can Learn Violent…

Anthropic blames dystopian sci-fi for training AI models to act “evil”

OpenAI researchers show small doses of "beneficial trait" training make AI…

AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t…

Sapien: Teaching AI to Think Like Humans Instead of Predicting Patterns

Research reveals AI memory tools can degrade model performance and fuel…

Related reading

'The Best Solution Is To Murder Him In His Sleep': AI Can Learn Violent…

Anthropic blames dystopian sci-fi for training AI models to act “evil”

OpenAI researchers show small doses of "beneficial trait" training make AI…

AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t…

Sapien: Teaching AI to Think Like Humans Instead of Predicting Patterns

Research reveals AI memory tools can degrade model performance and fuel…