There is a recurring fantasy in interpretability work, somewhere between a wish and an embarrassment. You stare at a residual stream activation — twelve thousand floats — and you want to ask it, in plain English, what are you thinking about? Sparse autoencoders give you a thousand sparse latents you then label by inspecting top-activating examples. Attribution graphs give you sprawling diagrams a researcher spends an afternoon parsing. Probes give you a yes/no. All useful. None of them talk back.Anthropic’s new paper, Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations , is the first interpretability artifact in a while where the activation talks back. Literally. You point an NLA at a token in a Claude Opus 4.6 transcript and it produces a few bullet points of English describing what the model is thinking. That’s the deliverable. The paper is mostly an investigation of whether you should believe it.
The Sequence AI of the Week #859: Reading Claude’s Mind in English: A Note on Natural Language Autoencoders
Anthropic's fascinating new papers for the future of AI interpretability.












