Most of the work that makes an AI agent good never happens inside the model. It happens in the harness — the code that feeds the model its observations, defines its tools, trims its context, and decides what to do with each response. When an agent fails, the usual fix is a human editing that harness: rewording a tool description, adding a memory store, changing how a screenshot gets summarized. The Continual Harness work, from the teams behind Gemini Plays Pokémon and the PokeAgent benchmark, pushes on a sharper question — what if the model edited the harness itself, while the run was still going?
The harness is where agents actually live
Gemini Plays Pokémon was a public demonstration: a Gemini model worked through a Game Boy Pokémon title via a harness that turned the game into something a language model could reason about. The harness converted pixels into labeled screenshots, a map of the current area, and an inventory list, then exposed button presses and pathfinding helpers as tools. The model never touched raw emulator memory. It saw whatever the harness chose to show it, and it acted only through the tools the harness defined.
That structure is not specific to Pokémon. A coding agent doesn't see your repository — it sees the files a retrieval step pulled in. A browser agent doesn't see a webpage — it sees an accessibility tree some extraction code produced. The harness is the agent's entire sensory system, its motor system, and its memory. The model is one component inside it.









