Back to OverviewFuture AI systems will be even more powerful than today’s, likely in ways that break key assumptions behind current safety techniques. That’s why it’s important to develop sophisticated safeguards to ensure models remain helpful, honest, and harmless. The Alignment team works to understand the challenges ahead and create protocols to train, evaluate, and monitor highly-capable models safely.Evaluation and oversightAlignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained. They also develop methods to allow humans to collaborate with language models to verify claims that humans might not be able to on their own.Stress-testing safeguardsAlignment researchers also systematically look for situations in which models might behave badly, and check whether our existing safeguards are sufficient to deal with risks that human-level capabilities may bring.May 8, 2026AlignmentTeaching Claude whyMay 7, 2026AlignmentDonating our open-source alignment toolApr 14, 2026AlignmentAutomated Alignment Researchers: Using large language models to scale scalable oversightFeb 25, 2026AlignmentAn update on our model deprecation commitments for Claude Opus 3Feb 23, 2026AlignmentThe persona selection modelJan 29, 2026AlignmentHow AI assistance impacts the formation of coding skillsJan 28, 2026AlignmentDisempowerment patterns in real-world AI usageJan 9, 2026AlignmentNext-generation Constitutional Classifiers: More efficient protection against universal jailbreaksDec 19, 2025AlignmentIntroducing Bloom: an open source tool for automated behavioral evaluationsNov 21, 2025AlignmentFrom shortcuts to sabotage: natural emergent misalignment from reward hackingSee more
Alignment Research
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.











