Stop AI Hallucinations: How to Make Natural Language Testing Real with "Harness Engineering"
Abstract
When the system under test is a business-process-intensive software system (such as a configurable AI Agent platform), traditional automation testing hits a ceiling—API combinatorial explosion, business understanding gaps, and soaring maintenance costs. This article documents our complete journey from pytest hardcoded wrappers to building a CLI SDK and Skill system, ultimately achieving "describe a test scenario in natural language and it executes." While the Agent system serves as our example, this approach essentially applies to any software product exposing APIs—e-commerce, fintech, SaaS, or internal tools. The core insight shifts from Context Engineering (feeding code to AI for test generation) to Harness Engineering (putting a "harness" on AI), fully leveraging AI's decision-making capabilities while constraining how it understands the business system—bringing testing back to its original form: describing behavior and expectations in natural language.
1. Background: When Business System Complexity Exceeds Traditional Automation Testing
To illustrate concretely, consider a system we deeply tested: a user-facing Agent building platform where users can:






