Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

A new study by Anthropic shows that language models might learn hidden characteristics during distillation, a popular method for fine-tuning models for special tasks. While these hidden traits, which the authors call “subliminal learning,” can be benign, the research finds they can also lead to unwanted results, such as misalignment and harmful behavior.

What is subliminal learning?

Distillation is a common technique in AI application development. It involves training a smaller “student” model to mimic the outputs of a larger, more capable “teacher” model. This process is often used to create specialized models that are smaller, cheaper and faster for specific applications. However, the Anthropic study reveals a surprising property of this process.

The researchers found that teacher models can transmit behavioral traits to the students, even when the generated data is completely unrelated to those traits.