Reading Claude's Mind: Anthropic's Natural Language Autoencoders Open a New Window Into Agent Alignment
Anthropic unveils Natural Language Autoencoders (NLAs), a technique that converts Claude's internal activations into readable text — revealing hidden evalu