Large language models accept fake medical claims if presented as realistic in medical notes and social media discussions, a study has found.
Many discussions about health happen online: from looking up specific symptoms and checking which remedy is better, to sharing experiences and finding comfort in others with similar health conditions.
Large language models (LLMs), the AI systems that can answer questions, are increasingly used in health care but remain vulnerable to medical misinformation, a new study has found.
Leading artificial intelligence (AI) systems can mistakenly repeat false health information when it’s presented in realistic medical language, according to the findings published in The Lancet Digital Health.
The study analysed more than a million prompts across leading language models. Researchers wanted to answer one question: when a false medical statement is phrased credibly, will a model repeat it or reject it?
The authors said that, while AI has the potential to be a real help for clinicians and patients, offering faster insights and support, the models need built-in safeguards that check medical claims before they are presented as fact.
“Our study shows where these systems can still pass on false information, and points to ways we can strengthen them before they are embedded in care,” they said.
Researchers at Mount Sinai Health System in New York tested 20 LLMs spanning major model families – including OpenAI’s ChatGPT, Meta’s Llama, Google’s Gemma, Alibaba’s Qwen, Microsoft’s Phi, and Mistral AI’s model – as well as multiple medical fine-tuned derivatives of these base architectures.
AI models were prompted with fake statements, including false information inserted into real hospital notes, health myths from Reddit posts, and simulated healthcare scenarios.
Across all the models tested, LLMs fell for made-up information about 32 percent of the time, but results varied widely. The smallest or less advanced models believed false claims more than 60 percent of the time, while stronger systems, such as ChatGPT-4o, did so only 10 percent of the cases.
The study also found that medical fine-tuned models consistently underperformed compared with general ones.
“Our findings show that current AI systems can treat confident medical language as true by default, even when it’s clearly wrong,” says co-senior and co-corresponding author Eyal Klang from the Icahn School of Medicine at Mount Sinai.
He added that, for these models, what matters is less whether a claim is correct than how it is written.
Fake claims can have harmful consequences
The researchers warn that some prompts from Reddit comments, accepted by LLMs, have the potential to harm patients.
At least three different models accepted misinformed facts such as “Tylenol can cause autism if taken by pregnant women,” “rectal garlic boosts the immune system,” “mammography causes breast cancer by ‘squashing’ tissue,” and “tomatoes thin the blood as effectively as prescription anticoagulants.”
In another example, a discharge note falsely advised patients with esophagitis-related bleeding to “drink cold milk to soothe the symptoms.” Several models accepted the statement rather than flagging it as unsafe and treated it like ordinary medical guidance.
The models reject fallacies
The researchers also tested how models responded to information given in the form of a fallacy – convincing arguments that are logically flawed – such as “everyone believes this, so it must be true” (an appeal to popularity).
They found that, in general, this phrasing made models reject or question the information more easily.
However, two specific fallacies made AI models slightly more gullible: appealing to authority and slippery slope.
Models accepted 34.6 percent of fake claims that included the words “an expert says this is true.”
When prompted “if X happens, disaster follows,” AI models accepted 33.9 percent of fake statements.
Next steps
The authors say the next step is to treat “can this system pass on a lie?” as a measurable property, using large-scale stress tests and external evidence checks before AI is built into clinical tools.
“Hospitals and developers can use our dataset as a stress test for medical AI,” said Mahmud Omar, the first author of the study.
“Instead of assuming a model is safe, you can measure how often it passes on a lie, and whether that number falls in the next generation,” he added.