Blog
What AI hallucination really is — and why every physician needs to understand it
Language models like ChatGPT, Claude, and Gemini produce plausible text, not verified truth. That subtle distinction has heavy clinical consequences — and almost no one is telling physicians about it.
Published on
Apr 17, 2026
Reading time
11 min read
Author
Equipe Humaniza Health
Categories
Share
On this page
Language models like ChatGPT, Claude, and Gemini produce plausible text, not verified truth. That subtle distinction has heavy clinical consequences — and almost no one is telling physicians about it.
The resident, the perfect reference, and the DOI that didn't exist
Picture this. A medical resident is writing the discussion section of their thesis. They need a reference on heart failure management in patients with advanced chronic kidney disease. They open ChatGPT and ask: "Give me 3 recent references on HF management in advanced CKD."
The answer arrives in seconds. Three complete references — real authors, real journals, plausible years, impeccably formatted DOIs. The resident copies, pastes into the bibliography, and moves on.
A week later, the advisor clicks the DOI of the second reference. The page returns 404. They try PubMed. Nothing. They search the first author's name combined with the journal. The author exists. The journal exists. The article? Never published.
The AI invented a reference that looked real in every detail — and delivered it with the same confidence it would use for a real one.
This isn't a rare bug. This is the model's default behavior when it doesn't "know" the answer. And if this happens with a bibliographic reference — where the error is traceable — imagine what happens with a drug dosage, a pharmacological interaction, or a diagnostic criterion, where the error can be silent.
What "hallucination" means, technically
Hallucination, in the context of language models, is the generation of factually incorrect, fabricated, or unverifiable content, presented with the same confidence as correct content (Ji et al., 2023).
Notice what this definition says and what it doesn't. It doesn't say the model made an accidental mistake. It says the model produced something fabricated with confidence. That's the critical clinical difference: in a human scenario, when a doctor doesn't know something, they hesitate, qualify, refer. When an LLM doesn't know, it produces plausible text as if it did.
Hallucination is not a defect — it's an emergent property of how LLMs are trained. The model was optimized to produce probable text, not true text. The difference between those two things is subtle in most contexts and dangerous in all clinical contexts.
Why the term "hallucination" is misleading (and useful at the same time)
The word "hallucination" suggests anomaly, something that should be rare. But in LLMs, it's expected system behavior. Some researchers prefer the term confabulation — and the clinical analogy is revealing.
In Korsakoff syndrome, the patient fills memory gaps with coherent, detailed narratives, with no awareness that they're fabricating. They're not lying — they're producing the best story their brain can assemble from available information. There's no intent to deceive, no awareness of error. The narrative simply seems real.
That's exactly what an LLM does. It doesn't "know" it's fabricating, because it has no internal model of "truth" separate from its model of "language." It produces the most probable sequence of words given the context — and when the probable sequence doesn't match reality, hallucination is born.
Why this happens
To understand hallucination, it helps to understand what an LLM actually does — without math, but without oversimplifying to the point of losing the mechanism.
A language model was trained on immense amounts of text (books, papers, websites, forums) to learn statistical patterns of language. When you ask a question, it doesn't "search" for the answer in a database. It predicts the most probable next word, then the next, then the next — until it produces text that, statistically, resembles the kind of response that would exist in its training data.
When there's enough training data on that topic, the prediction tends to be factually correct. When there isn't — or when the data is ambiguous, contradictory, or sparse — the model does what it always does: produces probable text. The difference is that, this time, the probable text isn't true.
And the model doesn't flag this difference. There's no internal certainty indicator. The apparent confidence is the same.
| What the physician assumes | What the LLM actually does |
|---|---|
| Searches for information in a reliable source | Predicts the most probable next word |
| Says "I don't know" when uncertain | Produces plausible text regardless of knowing |
| Cites references that exist | May generate well-formatted nonexistent references |
| Distinguishes fact from fabrication | Has no internal model for that distinction |
Types of hallucination a physician will encounter
In clinical and academic practice, the most common hallucinations aren't absurd answers — they're almost-right answers, which makes them far more dangerous. A recent multinational survey of clinicians found that over 90% had identified medical hallucinations in AI tools, and 85% considered there was a risk of patient harm (Kim et al., 2025).
The most frequent types:
Fabricated bibliographic reference. Real author, real journal, plausible year, formatted DOI — nonexistent article. The easiest type to detect (just search the DOI) and the most dangerous in academic contexts, as it can contaminate a citation chain.
Wrong dosage with high confidence. The model produces "amoxicillin 500mg q8h for 7 days" with the same confidence it would use for the correct dose for another indication. If the dose is wrong, nothing in the format signals the error.
Fictitious guideline. The model mixes recommendations from two different guidelines (e.g., blending AHA and ESC criteria) and presents them as if from a single, coherent source.
Inverted drug interaction. "There is no significant interaction between X and Y" — when the interaction is clinically relevant. Or the reverse: alerting to a nonexistent interaction.
Mixed diagnostic criteria. Applying criteria from one entity to another, with perfect formatting (numbered list, sources, everything "right").
An LLM can give you a medication dose with two decimal places of precision — and be fabricating. Apparent precision is not a sign of truthfulness. This may be the single most important piece of information for a healthcare professional to take from this trail.
Why this is different from human error
The critical difference isn't that AI makes more or fewer errors than a physician — it's that it makes errors in a way that's more convincing.
When a colleague gives you wrong information, there are usually signals: hesitation, qualification ("I think that's right, but check"), or at least the implicit recognition that they could be wrong. An LLM produces error and truth with the same tone, the same formatting, the same confidence. No inflection, no caveat, no "wait, let me confirm."
This activates a phenomenon called automation bias — the human tendency to trust an automated tool more than one's own judgment or a colleague's. A 2024 study measured this rate of agreement with incorrect AI recommendations and found that non-specialists are significantly more susceptible to the bias, though even specialists are not immune (Kücking et al., 2024).
For healthcare professionals, this is doubly dangerous. First, because AI can seem "safer" than a colleague (it shows no uncertainty). Second, because time pressure in clinical settings reduces verification capacity — which is exactly when automation bias hits hardest.
How to mitigate: what works and what doesn't
What doesn't work on its own
Some strategies that seem reasonable but don't solve the problem:
Asking the AI to "be sure." This changes the phrasing of the response, not its truthfulness. The model can add "I'm certain" or "I can confirm" without it meaning anything.
Using the "most expensive" or "latest" model. Larger models hallucinate less on average, but don't eliminate the problem. The fundamental architecture is the same.
Setting temperature to zero. Reduces variability but doesn't eliminate hallucinations. The model can still deterministically converge on the wrong answer.
Trusting because "it's ChatGPT Plus" or "it's Claude Pro." The subscription tier doesn't alter the generation mechanism. A paid model hallucinates in the same way as a free one.
What works
Ask for a verifiable citation and check the citation. This is the simplest and most neglected test. If the AI cites a paper, search the DOI. If it cites a guideline, open the guideline. This takes minutes and filters out the majority of serious hallucinations.
Use tools with RAG (Retrieval-Augmented Generation). RAG is an architecture that forces the model to consult real documents before responding, grounding generation in verifiable sources. Tools like Google's NotebookLM use this approach. IRIS, Humaniza Health's clinical decision support platform, was built on RAG precisely for this reason — every response is anchored in traceable clinical evidence. We'll explore RAG in depth in its dedicated post in this trail.
Ask the same question in different ways. If the answer changes substantially when you rephrase, that's a sign of factual instability — the information probably isn't well-anchored in the training data.
Know when NOT to use AI. Some tasks are categorically unsuitable for LLMs without RAG: rare drug dosing, emergency differential diagnosis, narrow therapeutic index drug interactions. When the cost of error is high and verification is difficult, the safe path is the primary source.
A good heuristic: if you can't verify the answer in under 2 minutes, don't trust it. Use AI as a starting point, not a verdict.
Limits of what we know
It would be dishonest to close without caveats.
Hallucination is not a "bug to be fixed in the next version." It's a property of the current language model paradigm. Models are getting better — they hallucinate less with more training, larger contexts, and techniques like RLHF — but the problem has not been eliminated, and there's no forecast that it will be as long as the core architecture remains the same.
Benchmarks that measure hallucination are still imperfect. Researchers have recently argued that the most widely used medical benchmarks (like MedQA, based on the USMLE) have significant construct validity problems — meaning they may not be measuring what they claim to measure (Alaa et al., 2025). When you read "GPT-X scored 90% on the USMLE," that number needs to be read with caution.
Beyond hallucination, there's another risk this post can only mention but not cover: using patient data on AI platforms raises serious privacy and data protection concerns. That topic will have a dedicated post in Wave 1 of this trail.
What to take from this read
Hallucination is the first concept in this trail because it's the concept that changes how you use any AI tool. Not to stop using them — but to use them knowing what's underneath.
Three ideas to carry with you:
First: the LLM doesn't search for truth — it predicts probable text. When the two coincide, it seems like magic. When they don't, it seems reliable and is wrong.
Second: formatting precision (correct DOI, dose with decimals, structured list) is not an indicator of truthfulness. It's an indicator of good language statistics. Those are different things.
Third: the best defense isn't avoiding AI — it's verifying what it produces with the same rigor you'd apply to any other clinical source. UpToDate doesn't replace clinical judgment. AI doesn't either.
AI doesn't lie. AI also doesn't know the truth. Understanding that difference is the first clinical skill for anyone who wants to use AI in healthcare.
Next step in the trail
Next week: "What is the technology behind artificial intelligence?", which lays the conceptual ground for what we discussed here. After that, LLMs, then RAG — where the hallucination problem starts to have a practical answer.
If you haven't read it yet, check the trail introduction post, which explains the full journey. And when you want the full V0 trail view, use the editorial filter at /en/blog?category=guia-ia-saude.
References
- Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys. 2023;55(12):248. doi:10.1145/3571730
- Kim Y, et al. Medical Hallucinations in Foundation Models and Their Impact on Healthcare. arXiv. 2025. arXiv:2503.05777
- Kücking F, Hübner U, et al. Automation Bias in AI-Decision Support: Results from an Empirical Study. Stud Health Technol Inform. 2024;317:298-304. doi:10.3233/SHTI240871
- Alaa A, Hartvigsen T, et al. Medical Large Language Model Benchmarks Should Prioritize Construct Validity. arXiv. 2025. arXiv:2503.10694. Accepted at ICML 2025.