Huawei unveils Claw-Anything benchmark, revealing AI agents' limitations in personal assistant tasks

Your AI assistant can summarize a PDF and set a timer. Ask it to manage your actual digital life across multiple devices, services, and days of accumulated context, and things fall apart fast. That’s the uncomfortable conclusion from Huawei’s new Claw-Anything benchmark, which simulates the messy reality of being a human with a phone, a laptop, and too many apps.

The benchmark, published as a preprint on arXiv on May 25, was developed by Huawei researchers alongside teams from Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences’ Institute of Automation. Its purpose is straightforward: test whether AI agents can function as always-on personal assistants in environments that actually resemble real life.

The results are humbling

GPT-5.5, currently among the most capable large language models available, scored a 34.5% pass@1 rate on Claw-Anything. In English: when given one shot at completing a realistic personal assistant task, the model failed roughly two out of every three times.

That number looks even worse when you compare it to how these models perform on more constrained benchmarks. Previous evaluations like ClawBench, which tested AI agents on 153 everyday online tasks, saw top models scoring between 33% and 44%. But those tests were simpler, more isolated, and less reflective of how people actually use digital tools.

The results are humbling

Huawei unveils Claw-Anything benchmark, revealing AI agents' limitations in personal assistant tasks

Huawei unveils Claw-Anything benchmark, revealing AI agents' limitations in personal assistant tasks

Other newsrooms on this story

Related reading

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them…

Why TECNO’s agentic AI EllaClaw is the perfect everyday assistant for all your…

Viral AI personal assistant seen as step change – but experts warn of risks

Anthropic says Claude can now use your computer to finish tasks for you in AI…

TECNO EllaClaw hands-on: An AI that doesn’t just talk — It acts | TechCabal

If Google can’t make AI agents useful, maybe no one can

Other newsrooms on this story

Related reading

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them…

Why TECNO’s agentic AI EllaClaw is the perfect everyday assistant for all your…

Viral AI personal assistant seen as step change – but experts warn of risks

Anthropic says Claude can now use your computer to finish tasks for you in AI…

TECNO EllaClaw hands-on: An AI that doesn’t just talk — It acts | TechCabal

If Google can’t make AI agents useful, maybe no one can