OpenAI researchers show small doses of "beneficial trait" training make AI models broadly safer and harder to manipulate

OpenAI researchers show that reinforcement learning on desired behavioral traits like truthfulness and corrigibility works across domains. Training on health data also improved deception detection, and the model scored better on 44 out of 53 benchmarks. The approach differs from Anthropic's constitution-based method.

venerdì 19 giugno 2026 New tab

TL;DRAI

OpenAI trained models on beneficial behavioral traits via RL, improving 44 of 53 safety benchmarks including deception and reward hacking, with gains generalizing across unfamiliar domains. The approach shows selective persistence against harmful steering without losing flexibility—offering an empirical governance path for production AI safety.

488 words~2 min read

Jun 19, 2026

Nano Banana Pro prompted by THE DECODER

Reinforcement learning on realistic scenarios with desired behavioral traits is supposed to make AI models safer and more helpful across domains. The approach is fundamentally different from Anthropic's constitutional method.

When AI models are trained on problematic behavior in one domain, that misalignment can spread to other areas. OpenAI researchers have now tested whether the reverse also works: Can good behavior generalize just as broadly?

OpenAI researchers show small doses of "beneficial trait" training make AI models broadly safer and harder to manipulate

OpenAI researchers show small doses of "beneficial trait" training make AI models broadly safer and harder to manipulate

Other newsrooms on this story

Related reading

OpenAI demonstrates alignment gains through reinforcement learning on…

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Anthropic blames dystopian sci-fi for training AI models to act “evil”

AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t…

Research reveals AI memory tools can degrade model performance and fuel…

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches…

Other newsrooms on this story

Related reading

OpenAI demonstrates alignment gains through reinforcement learning on…

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Anthropic blames dystopian sci-fi for training AI models to act “evil”

AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t…

Research reveals AI memory tools can degrade model performance and fuel…

‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches…