Research news

LLMs fall short in differential diagnosis if in initial low-data clinical consultations

14 Apr, 2026

by Alan Booth

4 min read

A large-scale evaluation of 21 artificial intelligence models has shown that, although systems can reach correct diagnoses when provided with full data, they continue to struggle with the stepwise reasoning that underpins real-world clinical decision-making

Despite a marked rise in the use of artificial intelligence (AI) within healthcare systems, a study led by researchers at Mass General Brigham, Boston, Massachusetts, USA, has found that generative models remain limited in their ability to replicate core elements of clinical reasoning. The work, conducted through its MESH Incubator has shown that large language models (LLMs) can reach accurate final diagnoses in controlled scenarios but fail consistently at earlier – and more cognitively demanding stages – of the diagnostic process.

The investigation assessed 21 LLMs which the researchers tasked to act as clinicians across a structured series of clinical scenarios. Although the models achieved correct final diagnoses in more than 90 per cent of cases when supplied with all relevant patient information, they performed poorly when required to construct a differential diagnosis or to determine an appropriate sequence of diagnostic tests. These early steps – which depend on hypothesis generation under uncertainty – represent a central component of medical reasoning and remain difficult for current systems to reproduce.

“Despite continued improvements, off-the-shelf LLMs are not ready for unsupervised clinical-grade deployment,” said Dr. Marc Succi, corresponding author of the study and executive director of the MESH Incubator.

“Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate.

“The promise of AI in clinical medicine continues to lie in its potential to augment – not replace – a physician’s reasoning, provided all the relevant data is available [which is] not always the case,” he said.

The study has built on earlier work from the same group which evaluated the performance of earlier-generation models such as ChatGPT version 3.5 in diagnostic tasks. In the present analysis, the researchers introduced a more comprehensive evaluation framework – termed PrIME-LLM – designed to assess model competence across the full clinical reasoning pathway.

Rather than rely on aggregate accuracy alone, the framework measures performance at discrete stages, including the generation of potential diagnoses, the selection of appropriate investigations, the establishment of a final diagnosis and the formulation of a treatment plan. This approach allows the identification of imbalances in capability that conventional scoring systems may obscure.

To simulate real clinical practice, the researchers presented 29 published clinical cases to each model in a stepwise fashion. Initial prompts included only basic demographic and symptomatic information, such as patient age, sex and presenting complaint. Additional data, including physical examination findings, laboratory results and imaging, was introduced progressively. Medical student evaluators assessed model responses at each stage, and these assessments informed the calculation of overall PrIME-LLM scores.

While models performed well when sufficient data allowed a direct inference of diagnosis, they struggled to operate under conditions of incomplete information. In particular, all models failed to produce an appropriate differential diagnosis in more than 80 per cent of cases.

In clinical practice, the differential diagnosis functions as a structured list of plausible conditions that guides subsequent testing and management decisions. Its absence or inaccuracy can lead to inappropriate investigations, delayed diagnosis or missed pathology.

“By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor,” said Dr. Arya Rao, lead author of the study and a researcher within the MESH programme.

“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there is not much information,” she said.

Performance improved when models received additional structured inputs, including laboratory and imaging data, which reduced uncertainty and allowed pattern-matching approaches to dominate. More recent model iterations generally outperformed earlier versions, which indicates incremental progress in capability. Nevertheless, variability persisted across systems, with PrIME-LLM scores ranging from 64 per cent for Gemini 1.5 Flash to 78 per cent for both Grok 4 and GPT-5, at the time of evaluation.

The introduction of the PrIME-LLM framework represents an attempt to standardise the assessment of clinical competence in AI systems. According to the authors, such a tool could allow developers, regulators and healthcare organisations to benchmark model performance in a consistent and clinically relevant manner. This becomes increasingly important as interest in deploying generative AI within diagnostic workflows continues to expand.

“We want to help separate the hype from the reality of these tools as they apply to healthcare,” said Succi.

“Our results reinforce that LLMs in healthcare continue to require a ‘human in the loop’ and very close oversight,” he added.

Taken together, the findings have reinforced a cautious view of generative AI in medicine. Although systems have shown an ability to support specific tasks, particularly where structured data is available, they have not yet demonstrated the capacity to replicate the nuanced, iterative reasoning that clinicians apply in real-world settings.

For further reading please visit: 10.1001/jamanetworkopen.2026.4003