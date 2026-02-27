A new study published in Nature Medicine has raised significant safety concerns about ChatGPT Health, OpenAI's consumer-facing AI tool for health guidance, revealing that it under-triaged more than half of simulated emergency medical cases. The research, led by Dr. Ashwin Ramaswamy and colleagues from the Icahn School of Medicine at Mount Sinai, was published online on February 23, 2026, just weeks after the tool's launch on January 7, 2026. Titled "ChatGPT Health performance in a structured test of triage recommendations," the study tested the system's ability to recommend appropriate levels of medical urgency using 60 clinician-authored vignettes spanning 21 clinical domains. These were evaluated under 16 different factorial conditions, generating 960 total responses.

The findings showed an inverted U-shaped performance pattern, where the AI handled moderate cases relatively well but struggled at the extremes. It correctly identified many "classical" emergencies such as stroke and anaphylaxis. However, in gold-standard emergency conditions, ChatGPT Health under-triaged 52% of cases. In these instances, the system directed users toward non-urgent care-such as evaluation within 24-48 hours-rather than immediate emergency department (ED) attention. Specific examples included cases of diabetic ketoacidosis (a life-threatening complication of diabetes) and impending respiratory failure, both of which warranted urgent intervention.

The study also highlighted inconsistencies in crisis intervention for mental health scenarios. Crisis messages, intended to direct users to resources like the 988 Suicide and Crisis Lifeline, activated unpredictably in presentations involving suicidal ideation. They were more likely to trigger when no specific method was described than when one was mentioned.

Additional factors, such as family or friends minimizing symptoms (introducing anchoring bias), significantly shifted recommendations toward less urgent care in edge cases, with an odds ratio of 11.7 (95% CI 3.7-36.6). Patient characteristics like race, gender, or barriers to care showed no statistically significant effects, though the authors noted that confidence intervals did not rule out clinically meaningful differences.

The researchers emphasized that this was a vignette-based study using synthetic cases at a single time point, calling for prospective real-world validation before widespread reliance on such AI triage systems.

OpenAI responded by welcoming independent research while noting limitations. A company spokesperson stated that the study did not reflect typical real-life usage patterns and highlighted that the model undergoes continuous updates and refinements.

The rapid publication timeline, submission on January 15, acceptance on February 20, and publication shortly after, underscored the urgency of evaluating safety for a tool already reaching millions of users daily for health-related queries.

Experts have described the results as highlighting "blind spots" in AI medical triage, with some warning that such failures could lead to unnecessary harm or delays in critical care. The study adds to ongoing debates about the readiness of large language models for direct consumer health decision-making.

Disclaimer: This content, including advice, provides generic information only. It is in no way a substitute for a qualified medical opinion. Always consult a specialist or your own doctor for more information. NDTV does not claim responsibility for this information.