LLMs show a “highly unreliable” capacity to describe their own internal processes

WHY ARE WE ALL YELLING?!

Anthropic

Unfortunately for AI self-awareness boosters, this demonstrated ability was extremely inconsistent and brittle across repeated tests. The best-performing models in Anthropic’s tests—Opus 4 and 4.1—topped out at correctly identifying the injected concept just 20 percent of the time.

In a similar test where the model was asked “Are you experiencing anything unusual?” Opus 4.1 improved to a 42 percent success rate that nonetheless still fell below even a bare majority of trials. The size of the “introspection” effect was also highly sensitive to which internal model layer the insertion was performed on—if the concept was introduced too early or too late in the multi-step inference process, the “self-awareness” effect disappeared completely.

Show us the mechanism

WHY ARE WE ALL YELLING?!

Anthropic

Show us the mechanism

LLMs show a “highly unreliable” capacity to describe their own internal processes

LLMs show a “highly unreliable” capacity to describe their own internal processes

Other newsrooms on this story

Related reading

AI will soon be capable of telling convincing lies

Google study shows LLMs abandon correct answers under pressure, threatening…

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone

‘Inconsistent’ AI detection ‘should prompt assessment rethink’

Forcing LLMs to be evil during training can make them nicer in the long run

No, Artificial Intelligence Is Not Conscious

Related reading

AI will soon be capable of telling convincing lies

Google study shows LLMs abandon correct answers under pressure, threatening…

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone

‘Inconsistent’ AI detection ‘should prompt assessment rethink’

Forcing LLMs to be evil during training can make them nicer in the long run

No, Artificial Intelligence Is Not Conscious

Other newsrooms on this story