A study published June 12, 2026, in Nature Medicine found that general-purpose large language models consistently outperformed dedicated clinical AI products across standardized medical tasks. The general-purpose models were also preferred by the clinicians using them.

What the study actually tested

The researchers pitted three major general-purpose LLMs against purpose-built medical tools. On one side: OpenAI’s GPT-5.2, Google’s Gemini 3.1 Pro Preview, and Anthropic’s Claude Opus 4.6. On the other: dedicated clinical products like OpenEvidence and UpToDate Expert AI, tools specifically designed and marketed for healthcare professionals.

The battleground included MedQA questions, a well-established benchmark for evaluating medical knowledge drawn from medical licensing exams. The general-purpose models excelled across these tasks, beating the specialists on their home turf.

Google Search AI Overview was included as a control, representing the kind of quick-reference tool physicians actually reach for during a busy shift.