Here is a small failure mode that cost me time for longer than it should have.

The agent would run the test suite. Tests would fail. The agent would announce that tests failed. Then, when it needed to know which tests failed, would either guess, ask, or run the suite a second time to scrape the output. Sometimes it would run it a third time. The failing tests were right there in the output of the first run; the agent didn't read them carefully enough to remember.

This is not the model being lazy. It is the model being honest about its attention. Test output is voluminous, the failure summary is buried at the bottom, and by the time the agent is reasoning about the next step ("which test failed, what assertion, what file"), the relevant lines have scrolled past whatever the agent actually retained. The cheapest way to get the information back is to run the command again.

Running the command again is the wrong answer. A full suite takes minutes. Doing it twice to learn something the first run already told you is a tax on every red-test loop, and the loops compound.

The fix: tee everything to a gitignored directory