TL;DR — I built an MCP server that lets Claude drive Windows desktop apps (UI Automation + screen capture + OCR + Visual Studio debugger). In a black-box test of Windows Calculator, the agent falsely flagged a "display bug" based on the AccessibleName string. When I told it to look again, it captured the real screen, compared its earlier claim against the pixels, and retracted its own finding. The retraction was the moment I knew this was worth shipping.
The moment
The agent was 30 minutes into autonomously poking at Windows Calculator. It had verified divide-by-zero, the old √4 − 2 precision bug, 0^0, overflow, error recovery. All green.
Then it stopped. New message:
"Sensei, the history shows √(4) マイナス ( 2= 0. There's an unclosed opening paren ( right after the minus sign. The result 0 is correct, but the expression display looks wrong. Possible display bug."






