Back to Articles
The "Black Box" Problem of Agent Benchmarks The Experiment: Diagnosing ITBench Agents Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns Finding 2: "Non-Fatal" vs. "Fatal" Failures The "Non-Fatal" (Benign) Flaws The "Fatal" Flaws Case Study: Gemini-3-Flash (Decisive but Overconfident) Case Study: GPT-OSS-120B A different (and more useful) way to read the plots: “fatal” vs “non-fatal” Recoverable / structural (show up even in successful traces) Fatal / decisive (strongly associated with failed traces) Conclusion Ayhan Sebin
Saurabh Jha
Rohan Arora
Daby Sow






