Ask five of the world’s most advanced AI models whether something is true, and two-thirds of the time, at least one of them will disagree with the group. That’s the headline finding from a new study by Lenz Research, which tested GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro on 1,000 real-world claims submitted by actual users to a fact-checking platform.

The results are sobering. Out of those 1,000 claims, 672, or 67%, produced at least one model that dissented from the panel majority. In English: if you’re treating any single AI model as your personal oracle of truth, you’re rolling the dice more often than you think.

The numbers behind the disagreement

Lenz Research didn’t just measure whether models agreed or disagreed in a binary sense. They looked at the depth of disagreement, too. A full 343 claims, roughly 34%, showed what the researchers call “substantive disagreements,” where the most-disagreeing pair of models landed two or more verdict categories apart on a scale that ranged from True to Mostly True to Misleading to False.

To quantify the overall level of agreement, the study used Krippendorff’s alpha, a standard statistical measure for inter-rater reliability. The score came in at 0.639 on an ordinal scale. For context, a score of 1.0 means perfect agreement, and most researchers consider anything below 0.667 to indicate only tentative conclusions should be drawn. The models, in other words, landed just below the threshold where social scientists would start feeling comfortable relying on the results.