Your AI assistant can summarize a PDF and set a timer. Ask it to manage your actual digital life across multiple devices, services, and days of accumulated context, and things fall apart fast. That’s the uncomfortable conclusion from Huawei’s new Claw-Anything benchmark, which simulates the messy reality of being a human with a phone, a laptop, and too many apps.

The benchmark, published as a preprint on arXiv on May 25, was developed by Huawei researchers alongside teams from Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences’ Institute of Automation. Its purpose is straightforward: test whether AI agents can function as always-on personal assistants in environments that actually resemble real life.

The results are humbling

GPT-5.5, currently among the most capable large language models available, scored a 34.5% pass@1 rate on Claw-Anything. In English: when given one shot at completing a realistic personal assistant task, the model failed roughly two out of every three times.

That number looks even worse when you compare it to how these models perform on more constrained benchmarks. Previous evaluations like ClawBench, which tested AI agents on 153 everyday online tasks, saw top models scoring between 33% and 44%. But those tests were simpler, more isolated, and less reflective of how people actually use digital tools.