Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.

Part 3 ranked 5 AI models by overall vulnerability rate. But when we broke the data down by security domain — database, auth, file I/O, command execution — the rankings inverted. The 'worst' model fixes 93% of database vulnerabilities. The 'best' model fails at remediation. Aggregate numbers hide domain expertise.

domenica 17 maggio 2026 New tab

Every AI benchmark I've seen makes the same mistake. They rank models by a single number — accuracy, pass rate, vulnerability rate — and call it a day.

In Part 3, we did exactly that. We ranked 5 models from Claude and Gemini by aggregate vulnerability rate and declared Haiku the safest (49%) and Gemini Pro the most dangerous (73%).

That ranking is real. It's also misleading.

TL;DR

When we broke 700 functions down by security domain, the rankings inverted. The model that "lost" the aggregate benchmark dominates the most important remediation category. The model that "won" has one of the lowest fix rates.

Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.

Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.

Other newsrooms on this story

Related reading

Security of 100 AI Agents Tested and Ranked – What You Need to Know

AI Still Can't Beat the On-Call Engineer: Here's Why - Decrypt

New benchmark exposes how badly AI struggles with real knowledge work

AI's Finance Problem Is Quantified — And That's Bullish for the Builders

AI model failure rates underestimated by 2.25x | VentureBeat

AI cybersecurity updates for MDASH, Mythos, and GPT-5.5.

Other newsrooms on this story

Related reading

Security of 100 AI Agents Tested and Ranked – What You Need to Know

AI Still Can't Beat the On-Call Engineer: Here's Why - Decrypt

New benchmark exposes how badly AI struggles with real knowledge work

AI's Finance Problem Is Quantified — And That's Bullish for the Builders

AI model failure rates underestimated by 2.25x | VentureBeat

AI cybersecurity updates for MDASH, Mythos, and GPT-5.5.