ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Back to Articles

Key findings: ITBench-AA SRE overview: Highlights ITBench-AA is built in partnership with @IBMResearch based on their ITBench benchmark. Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50%

ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models and agents must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by IBM Research, leveraging IBM’s deep expertise in enterprise IT operations.

Artificial Analysis has worked closely with IBM over the last 6 months to develop an implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time.

Key findings:

Back to Articles

Key findings:

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Other newsrooms on this story

Related reading

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial…

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

AstaBench update: New results, plus adoption from industry | Ai2

AI Still Can't Beat the On-Call Engineer: Here's Why - Decrypt

The Open Agent Leaderboard

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in…

Other newsrooms on this story

Related reading

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial…

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

AstaBench update: New results, plus adoption from industry | Ai2

AI Still Can't Beat the On-Call Engineer: Here's Why - Decrypt

The Open Agent Leaderboard

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in…