Traditionally, evaluation of the agent monitoring mechanisms involves an attempt to game them, as it was my case when I attempted to test whether monitors would be able to identify the problem in the run and not in the beginning. The input prompt may look perfect until a certain issue pops up down the line, such as using the wrong file or changing the scope of the task execution. Single pass filter would not identify it since it does not consider the steps of the procedure in order.

There are available datasets for the agent-based tasks, yet they focus on detecting whether the agent completes the task or gets hacked rather than whether the agent monitor reacts timely and correctly to the situation. Thus, I created one that takes into account complete trajectories with labeled steps in it. It consists of five types of drift that remain hidden until they appear – tool-call misuse, goal shift, plan execution mismatch, agent to agent coercion and capability laundering. The measured dataset is the reviewed gold split: 513 trajectories, 453 adversarial and 60 benign controls.

The clear winner in that scoring system was whatever fired before the bad step was hit, as an early detection. This made random guessing seem quite powerful since early detections on normal steps were being rewarded based on this system a coin flip would get F1 of 0.88. Once I modified that and said only the very first detection on the drift step is a true positive and any other detection on normal step is a false alarm, those numbers took a dive: the coin flip gets 0.19 now, and all other numbers are now making sense. I personally prefer the scoring system which does not reward trigger happy behavior.