Your agent deleted something it shouldn't have at 2am. The alert fired. Now answer three questions: what did it do, why did it do it, and what did it touch. If you're grepping JSON for the next 30 minutes, you don't have tracing. You have logs with a worse UI.
This is the part nobody instruments until after the first incident. So let's instrument it before.
Heartbeat logging vs decision tracing
Most agent logging captures the heartbeat. Agent ran. Tool called. Response returned. Everything's HTTP 200 and everything's useless.
[02:14:07] agent.run status=200







