Back to OverviewThe mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.Safety through understandingIt's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.Multidisciplinary approachSome Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.May 7, 2026InterpretabilityNatural Language Autoencoders: Turning Claude’s thoughts into textApr 2, 2026InterpretabilityEmotion concepts and their function in a large language modelMar 13, 2026InterpretabilityA “diff” tool for AI: Finding behavioral differences in new modelsJan 19, 2026InterpretabilityThe assistant axis: situating and stabilizing the character of large language modelsOct 29, 2025InterpretabilitySigns of introspection in large language modelsAug 1, 2025InterpretabilityPersona vectors: Monitoring and controlling character traits in language modelsMay 29, 2025InterpretabilityOpen-sourcing circuit tracing toolsMar 27, 2025InterpretabilityTracing the thoughts of a large language modelMar 13, 2025AlignmentAuditing language models for hidden objectivesFeb 20, 2025InterpretabilityInsights on Crosscoder Model DiffingSee more
Interpretability Research
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.












