A Stanford and Tsinghua paper ran a controlled experiment earlier this year. Same model. Same task. Different harness architecture.
The result: a 6x performance gap driven entirely by the system built around the model. Not the model itself.
This is not a prompt engineering insight. It is a systems architecture insight, and it changes where developers should invest their time when building agentic systems.
The 6x Gap
Meta-Harness tested Claude Opus 4.6 across two harness configurations on TerminalBench-2. The only variable was the scaffold: the code that manages tool calls, context windows, error recovery, and state persistence.








