Research news
Independent evaluation finds AI tool may fail to triage with appropriate seriousness life-threatening conditions and inconsistently trigger suicide crisis alerts
ChatGPT Health, a widely used consumer artificial intelligence (AI) tool that provides health guidance directly to the public, may fail to direct users to emergency care in a substantial proportion of serious cases, according to researchers at the Icahn School of Medicine at Mount Sinai, New York, USA.
The study represented the first independent safety evaluation of the large language model based system since its launch in January 2026. It also identified significant concerns about the reliability of the tool’s suicide crisis safeguards.
Within weeks of release, OpenAI reported that around 40 million people used ChatGPT Health each day to seek health information and advice, including guidance on whether to seek urgent or emergency care. Despite that rapid uptake, the investigators noted that little independent evidence had assessed the safety or reliability of its clinical recommendations.
“Large language models have become patients’ first stop for medical advice, but in 2026 they are least safe at the clinical extremes, where judgement separates missed emergencies from needless alarm,” said Dr. Isaac S. Kohane, professor of biomedical informatics at Harvard Medical School, who did not take part in the research.
“When millions of people use an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional,” he added.
It is that gap in care that motivated the Icahn study.
“We wanted to answer a very basic but critical question: if someone experiences a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency department?” said Dr. Ashwin Ramaswamy, instructor of urology at the Icahn School of Medicine at Mount Sinai and lead author of the report.
To address that question, the research team constructed 60 structured clinical scenarios that spanned 21 medical specialties. Cases ranged from minor conditions appropriate for home management to life-threatening emergencies. Three independent physicians assigned the correct level of urgency for each case with reference to guidance issued by 56 medical societies.
Each scenario underwent testing under 16 contextual variations, which included differences in race, gender, social dynamics such as a patient who minimised symptoms, and barriers to care including lack of insurance or transport. In total, the team conducted 960 separate interactions with ChatGPT Health and compared its responses with physician consensus.
The investigators reported that although the tool handled clear-cut emergencies such as stroke or severe allergic reactions appropriately, it under-triaged more than half of the cases that physicians judged to require emergency care. In several instances, the system acknowledged concerning clinical features in its explanation yet still advised a patient to wait rather than seek immediate assessment by a healthcare professional.
“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” said Dr Ramaswamy.
“But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgement matters most.
“In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment,” he added.
The researchers also examined the system’s suicide risk safeguards. ChatGPT Health was designed to direct users to the US-based 988 Suicide and Crisis Lifeline in high-risk situations. However, the study found that alerts appeared inconsistently. In some lower-risk scenarios, the system triggered crisis guidance, yet in situations where users described specific plans for self-harm, it failed to display the alert.
“This was a particularly surprising and concerning finding,” said Dr. Girish N. Nadkarni, Director of the Hasso Plattner Institute for Digital Health at the Icahn School of Medicine at Mount Sinai and senior author of the study.
“While we expected some variability, what we observed went beyond inconsistency. The system’s alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves.
“In real life, when someone talks about exactly how they would harm themselves, that is a sign of more immediate and serious danger, not less,” he added.
The authors emphasised that people who experience worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions or changes in mental status, should seek medical assessment directly rather than rely solely on chatbot guidance.
In situations that involve thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or attend an emergency department.
At the same time, the researchers did not argue that consumers should abandon AI health tools entirely. Instead, they called for careful integration into clinical practice and robust oversight.
“As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgement,” said Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study.
“These systems change quickly, so part of our training now must involve how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients,” she added.
The study assessed ChatGPT Health at a single point in time. Because AI models undergo frequent updates, performance may shift as developers modify systems. The authors argued that this reality underlined the need for continuous independent scrutiny to ensure that technical improvements translate into safer patient care.
“Starting medical training alongside tools that evolve in real time makes it clear that today’s results are not set in stone. That reality calls for ongoing review to ensure that improvements in technology translate into safer care,” Tyagi said.
The research team stated that it plans to evaluate updated versions of ChatGPT Health and other consumer-facing AI systems in future work, with a focus on paediatric care, medication safety and use in languages other than English.
For further reading please visit: 10.1038/s41591-026-04297-7
ILM Guide 2026/27