AI agent evaluation ends too early.

An early score indicates that the harness has been able to execute the few cases imagined by the team while the real evidence will emerge in production.

There is clear evidence that current evaluation methods focus primarily on success. A 2026 review of 15 published benchmarks found that safety and security were included in the scoring of none of the 15 benchmarks; cost efficiency was included in the primary protocol of none of the 15 benchmarks; 13 of the 15 benchmarks relied primarily or entirely on binary success metrics; and none of the 15 benchmarks reached 50% deployment-readiness coverage in the review’s framework.

A benchmark can test for capability to complete a task in general. There is also offline evaluation of a release before it is actually shipped. The real job of evaluation though starts after release.

The evaluation job keeps going.