The problem with 'human in the loop' AI? Often, it's the humans

Welcome to Eye on AI. In this edition…AI is outperforming some professionals…Google plans to bring ads to Gemini…leading AI labs team up on AI agent standards…a new effort to give AI models a longer memory…and the mood turns on LLMs and AGI.

Greetings from San Francisco, where we are just wrapping up Fortune Brainstorm AI. On Thursday, we’ll bring you a roundup of insights from the conference. But today, I want to talk about some notable studies from the past few weeks with potentially big implications for the business impact AI may have.First, there was a study from the AI evaluations company Vals AI that pitted several legal AI applications as well as ChatGPT against human lawyers on legal research tasks. All of the AI applications beat the average human lawyers (who were allowed to use digital legal search tools) in drafting legal research reports across three criteria: accuracy, authoritativeness, and appropriateness. The lawyers’ aggregate median score was 69%, while ChatGPT scored 74%, Midpage 76%, Alexi 77%, and Counsel Stack, which had the highest overall score, 78%.

One of the more intriguing findings is that for many question types, it was the generalist ChatGPT that was the most accurate, beating out the more specialized applications. And while ChatGPT lost points for authoritativeness and appropriateness, it still topped the human lawyers across those dimensions.The study has been faulted for not testing some of the better-known and most widely adopted legal AI research tools, such as Harvey, Legora, CoCounsel from Thompson Reuters, or LexisNexis Protégé, and for only testing ChatGPT among the frontier general-purpose models. Still, the findings are notable and comport with what I’ve heard anecdotally from lawyers.A little while ago I had a conversation with Chris Kercher, a litigator at Quinn Emanuel who founded that firm’s data and analytics group. Quinn Emanuel has been using Anthropic’s general purpose AI model Claude for a lot of tasks. (This was before Anthropic’s latest model, Claude Opus 4.5, debuted.) “Claude Opus 3 writes better than most of my associates,” Kercher told me. “It just does. It is clear and organized. It’s a great model.” He said he is “constantly amazed” by what LLMs can do, finding new issues, strategies, and tactics that he can use to argue cases.Kercher said that AI models have allowed Quinn Emanuel to “invert” its prior work processes. In the past, junior lawyers—who are known as associates—used to spend days researching and writing up legal memos, finding citations for every sentence, before presenting those memos to more senior lawyers who would incorporate some of that material into briefs or arguments that would actually be presented in court. Today, he says, AI is used to generate drafts that Kercher said are by and large better, in a fraction of the time, and then these drafts are given to associates to vet. The associates are still responsible for the accuracy of the memos and citations—just as they always were—but now they are fact-checking the AI and editing what it produces, not performing the initial research and drafting, he said.He said that the most experienced, senior lawyers often get the most value out of working with AI, because they have the expertise to know how to craft the perfect prompt, along with the professional judgment and discernment to quickly assess the quality of the AI’s response. Is the argument the model has come up with sound? Is it likely to work in front of a particular judge or be convincing to a jury? These sorts of questions still require judgment that comes from experience, Kercher said.Ok, so that’s law, but it likely points to ways in which AI is beginning to upend work within other “knowledge industries” too. Here at Brainstorm AI yesterday, I interviewed Michael Truell, the cofounder and CEO of hot AI coding tool Cursor. He noted that in a University of Chicago study looking at the effects of developers using Cursor, it was often the most experienced software engineers who saw the most benefit from using Cursor, perhaps for some of the same reasons Kercher says experienced lawyers get the most out of Claude—they have the professional experience to craft the best prompts and the judgment to better assess the tools’ outputs.

The problem with 'human in the loop' AI? Often, it's the humans | Fortune

The problem with 'human in the loop' AI? Often, it's the humans | Fortune

Related reading

AI keeps getting more powerful, making it harder to judge how smart models…

How to run a company when the AI agents vastly outnumber the humans | Fortune

It's like collaborating with an alien. Why better than the average human may be…

Agentic AI systems are doing more and more work. Now humans need to figure out…

Why we should all pay attention to what lawyers, auditors, and accountants are…

Finding ROI from AI comes from first principles thinking, executives say at…