The first time you let an AI agent control a desktop, it feels impressive.
Then it misses a button by 40 pixels.
Or it clicks the window behind the window. Or it types into the wrong field because a notification stole focus. Or it spends ten seconds looking at a screenshot just to decide where a textbox probably is.
That was the part of desktop automation that bothered me. The model was not really failing at reasoning. It was being forced to reverse-engineer an application from pixels.
Screenshot-first is the wrong default






