Stanford and Tsinghua ran a controlled experiment: same model, same task, different harness. A 6x performance gap. Here is what developers need to know.