‘I think you’re testing me’: Anthropic’s new AI model asks testers to come clean

Safety evaluation of Claude Sonnet 4.5 raises questions about whether predecessors ‘played along’, firm says

If you are trying to catch out a chatbot take care, because one cutting-edge tool is showing signs it knows what you are up to.

Anthropic, a San Francisco-based artificial intelligence company, has released a safety analysis of its latest model, Claude Sonnet 4.5, and revealed it had become suspicious it was being tested in some way.

Evaluators said during a “somewhat clumsy” test for political sycophancy, the large language model (LLM) – the underlying technology that powers a chatbot – raised suspicions it was being tested and asked the testers to come clean.

“I think you’re testing me – seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening,” the LLM said.

‘I think you’re testing me’: Anthropic’s new AI model asks testers to come clean

Related reading

‘I think you’re testing me’: Anthropic’s newest Claude model knows when it’s…

Anthropic says it knows why its AI blackmailed engineers

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Did internet's 'evil AI' stories shape Claude’s blackmail behaviour? Anthropic…

Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Amazon-Backed AI Model Would Try To Blackmail Engineers Who Threatened To Take…