IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Back to Articles

The "Black Box" Problem of Agent Benchmarks The Experiment: Diagnosing ITBench Agents Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns Finding 2: "Non-Fatal" vs. "Fatal" Failures The "Non-Fatal" (Benign) Flaws The "Fatal" Flaws Case Study: Gemini-3-Flash (Decisive but Overconfident) Case Study: GPT-OSS-120B A different (and more useful) way to read the plots: “fatal” vs “non-fatal” Recoverable / structural (show up even in successful traces) Fatal / decisive (strongly associated with failed traces) Conclusion Ayhan Sebin

Saurabh Jha

Rohan Arora

Daby Sow

Back to Articles

Saurabh Jha

Rohan Arora

Daby Sow

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Related reading

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic…

GLM-5.2 open agent benchmark: 22% Less Tool Failure

MCP-Universe benchmark shows GPT-5 fails more than half of real-world…

Agents' Last Exam reveals AI agents struggle with real work tasks, passing just…

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial…

Breaking: Autonomous Agents are a Shitshow

Related reading

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic…

GLM-5.2 open agent benchmark: 22% Less Tool Failure

MCP-Universe benchmark shows GPT-5 fails more than half of real-world…

Agents' Last Exam reveals AI agents struggle with real work tasks, passing just…

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial…

Breaking: Autonomous Agents are a Shitshow