The first time you let an AI agent control a desktop, it feels impressive.

Then it misses a button by 40 pixels.

Or it clicks the window behind the window. Or it types into the wrong field because a notification stole focus. Or it spends ten seconds looking at a screenshot just to decide where a textbox probably is.

That was the part of desktop automation that bothered me. The model was not really failing at reasoning. It was being forced to reverse-engineer an application from pixels.

Screenshot-first is the wrong default