Top Frontier AI Models Top Out At C+ ... Barely Better Than Old Models

Top frontier AI models aren't that top. In fact, according to a new study, they max out around the C+ level.gettyTop new frontier AI models from OpenAI and Anthropic are more expensive, and they come with gaudy new claims of higher intelligence and superior results. But according to a new study of 510 questions by expert network Pearl, they don’t actually improve performance all that much. In fact, they're all clustering just below the level where professionals would actually trust them. Pearl tested 25 of the world’s leading AI models including GPT-5.5, Claude Opus 4.7 and Gemini with real licensed professionals judging the answers. The result: none of the models exceed 73%.Which is probably a C grade, maybe a C+.GPT-5.5 was tops at 72.7%, with 5.1 at 72.0%Claude Opus 4.7 scored 71.9%, with 4.6 at 69.8%Gemini 3 Pro hit 67.3%, with 2.5 Pro at 64.5%"Benchmarks measure whether a model can pass a test. We’re asking whether a professional would trust the answer, and right now, the answer is no," said Pearl CEO Andy Kurtzig. "Almost right is still wrong."Pearl assembled roughly 510 questions across five professional domains: business, health, law, pets and technology. None had never been released publicly and were not available to model developers during training. Each of the 25 AI models received identical prompts with no tuning or prompt engineering, and responses were graded by credentialed experts on a 1-to-5 rubric measuring four dimensions: correctness, completeness, prioritization, and professional judgment. That last criterion is where Pearl is making its sharpest claim: that getting the right answer isn't enough if a model can't weigh what matters, flag what's urgent, or recognize when a question requires escalation rather than an answer. MORE FOR YOUPearl also tested models in both minimum and maximum reasoning configurations, and says that showed that more inference-time compute delivers only 1-2.6% improvement … and occasionally produced worse answers.That’s not impressive.Some areas were better, of course. Top models hit 80.9% in business, for instance. But in law and health, Pearl says some widely-used models dropped to around 20% expert alignment: unimpressive at best, dangerous at worst.Of course, there’s a big caveat to mention here.Pearl is an expert network -- of humans – that builds AI systems with experts in the loop. In other words, Pearl is not a neutral academic outfit. That doesn’t make the data wrong, of course. But it’s worth keeping in mind. The other caveat is that 70% might be OK for some businesses who then expect their human staff to pick up where their AI agents have left off.But for those executives at companies like Cisco and Meta that are shedding human workers to align with the age of AI, the results should remind them that AI makes more than a few mistakes in every domain, and makes serious errors in specific high-impact areas like health and law.So maybe we can’t let go of all the humans just yet.

Top Frontier AI Models Top Out At C+ ... Barely Better Than Old Models

Top Frontier AI Models Top Out At C+ ... Barely Better Than Old Models

Other newsrooms on this story

Related reading

AI models are getting very good at professional tasks, new OpenAI research…

Cheap AI models challenge Anthropic, OpenAI IPO valuations amid rising…

AI Models Can’t Agree on Basic Facts Most of the Time, Study Shows - Decrypt

AI Still Can't Beat the On-Call Engineer: Here's Why - Decrypt

AI's hacking skills are outgrowing the tests

Do we need smarter AI or smarter use of AI?

Other newsrooms on this story

Related reading

AI models are getting very good at professional tasks, new OpenAI research…

Cheap AI models challenge Anthropic, OpenAI IPO valuations amid rising…

AI Models Can’t Agree on Basic Facts Most of the Time, Study Shows - Decrypt

AI Still Can't Beat the On-Call Engineer: Here's Why - Decrypt

AI's hacking skills are outgrowing the tests

Do we need smarter AI or smarter use of AI?