Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm Versions Autonomously

tl;dr — Agents are good at small fixes and terrible at "make this algorithm better" because every change looks good in isolation and silently regresses elsewhere. We built an AI harness — immutable test set, multi-axis rubric, sweep tool, independent reviewer agent, human-viewable eval interface, knowledge-persistence layer — that lets an agent iterate on a real algorithm autonomously while a human still contributes intuition at the right layer. Twenty-five shipped versions of our color quantization pipeline in 13 days (git log doesn't lie); the most recent six of them in five days, right after the harness itself got an upgrade. The harness, not the prompts or full automation, is what made that pace safe.

Two important caveats up front, because they change how the rest of this article reads:

This is spare-time work on a side project — evenings, weekends, the occasional lunch break. Not a full-time team, not a dedicated sprint.

The author is a product manager who cannot read a single line of code. Every line of the actual quantization pipeline was written by AI agents. The author's contribution is the harness, the hypotheses, and the human-eye review pass.

The reason 25 versions is even possible at that intensity, from that author, is that the harness does the watching. Once the loop is set up, each iteration is "open a terminal, propose a hypothesis in plain English, glance at the dashboard, approve or revert." If anything, those two caveats are the strongest argument for building the harness: it converts the small windows of time a non-engineer actually has into shipped algorithm progress that would normally require a full-time CV engineer.

Two important caveats up front, because they change how the rest of this article reads:

This is spare-time work on a side project — evenings, weekends, the occasional lunch break. Not a full-time team, not a dedicated sprint.

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm Versions Autonomously

Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm Versions Autonomously

Related reading

AI Agent Evaluation Harness: Test Real Workflows Before Users Do

Harness engineering: the missing layer for reliable coding agents

Researchers introduce Self-Harness, a framework that lets AI agents rewrite…

The Sequence Opinion #844: Harness Engineering: The Operating System for…

The agent that fixes bugs by running the code

Harness Engineering — The Quality Pillar of Agentic Engineering

Related reading

AI Agent Evaluation Harness: Test Real Workflows Before Users Do

Harness engineering: the missing layer for reliable coding agents

Researchers introduce Self-Harness, a framework that lets AI agents rewrite…

The Sequence Opinion #844: Harness Engineering: The Operating System for…

The agent that fixes bugs by running the code

Harness Engineering — The Quality Pillar of Agentic Engineering