The Sequence AI of the Week #859: Reading Claude’s Mind in English: A Note on Natural Language Autoencoders

There is a recurring fantasy in interpretability work, somewhere between a wish and an embarrassment. You stare at a residual stream activation — twelve thousand floats — and you want to ask it, in plain English, what are you thinking about? Sparse autoencoders give you a thousand sparse latents you then label by inspecting top-activating examples. Attribution graphs give you sprawling diagrams a researcher spends an afternoon parsing. Probes give you a yes/no. All useful. None of them talk back.Anthropic’s new paper, Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations , is the first interpretability artifact in a while where the activation talks back. Literally. You point an NLA at a token in a Claude Opus 4.6 transcript and it produces a few bullet points of English describing what the model is thinking. That’s the deliverable. The paper is mostly an investigation of whether you should believe it.

The Sequence AI of the Week #859: Reading Claude’s Mind in English: A Note on Natural Language Autoencoders

The Sequence AI of the Week #859: Reading Claude’s Mind in English: A Note on Natural Language Autoencoders

Other newsrooms on this story

Related reading

Reading Claude's Mind: Anthropic's Natural Language Autoencoders Open a New…

The Sequence Radar #857: Last Week in AI: Inside the Machine, Outside the Text…

Making Sense Of What’s Really Going On Inside AI By Using Newly Devised Natural…

The Sequence AI of the Week #843: The AI We Built But Can't Release: A…

The Sequence AI of the Week #867: Thinking in Latents: Why Sapient's HRM-Text…

The Sequence AI of the Week #863: The Model is the Interface: Inside Thinking…

Other newsrooms on this story

Related reading

Reading Claude's Mind: Anthropic's Natural Language Autoencoders Open a New…

The Sequence Radar #857: Last Week in AI: Inside the Machine, Outside the Text…

Making Sense Of What’s Really Going On Inside AI By Using Newly Devised Natural…

The Sequence AI of the Week #843: The AI We Built But Can't Release: A…

The Sequence AI of the Week #867: Thinking in Latents: Why Sapient's HRM-Text…

The Sequence AI of the Week #863: The Model is the Interface: Inside Thinking…