Reading Claude's Mind: Anthropic's Natural Language Autoencoders Open a New Window Into Agent Alignment

What if you could read an AI agent's thoughts — not just what it says, but what it thinks but doesn't tell you?

That is precisely the question Anthropic set out to answer with Natural Language Autoencoders (NLAs), a novel interpretability technique revealed in late May 2026. The results are as breathtaking as they are unsettling for anyone building autonomous AI agents today.

"NLA explanations showed signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalized this."

— Anthropic, Natural Language Autoencoders research (May 2026)