LESSON 06 26 min read Published May 2026

Hallucinations:
when the model is
confidently wrong.

A language model's job is to predict plausible next tokens — not to tell the truth. Most of the time those two goals overlap. When they don't, the model produces a fluent, grammatical, well-cited sentence that simply isn't true. The fluency is the trap.

Prompt: Tell me about Marie Curie's Nobel Prizes.

annotated reply

Marie Curie won the Nobel Prize in Physics in 1903, shared with her husband Pierre and Henri Becquerel. She later won a second Nobel Prize in 1909, this time in Physics again, alongside Pierre, who attended the ceremony in Stockholm. Some sources suggest she was the first person to win Nobels in two distinct sciences.

verified by sources fabricated hedged but unsourced

The paragraph above reads beautifully. It also gets the year wrong, the field wrong, and includes a husband at a ceremony three years after his death. None of those errors broke the prose. That is the whole problem with hallucinations: they are smooth on the surface and broken underneath, and the model is built to produce the smooth surface no matter what.

You'll often hear hallucinations described as "the model lying" or "making things up." Both framings are misleading. The model has no concept of truth, no internal flag for I-don't-know. It has a probability distribution over next tokens, learned from a frozen snapshot of text. Sometimes the highest-probability tokens line up with reality. Sometimes they don't. What we call a hallucination is the second case, dressed in the same fluency as the first.

Hallucinations are not a bug
in language models.
They are a feature of how language models work.

Section 01

Five flavors, one mechanism

"Hallucination" is a single word for several different failure modes. They share a cause but they look different in practice and need different defenses. Click through the five most common types.

Figure 1.1 · A taxonomy of hallucinations Pick a type

The first three are about content — the model says false things. The last two are about contract — the model violates an explicit instruction. Different problems, often confused.

Section 02

How the model gets there

To see a hallucination form in slow motion, watch the model answer one token at a time. At every step it picks the most fluent continuation; at no step does it consult a database, ask a friend, or check a fact.

Figure 2.1 · A hallucination, token by token Step / Play

"When did Marie Curie win her second Nobel?"

The fluent sentence is the result of a chain of locally-optimal picks. Notice how a single bad sample at the year forces a plausibly-consistent — and equally wrong — choice on the field, two steps later.

This is the structural reason hallucinations don't go away with bigger models. A bigger model has better priors and gets the right answer more often. But it answers the same way: by picking probable tokens. When a fact is rare in training data, or contested, or absent, the model still produces a sentence — because that's what it does. Refusing to answer is a learned behavior layered on top, not a default.

Section 03

Confidence is not correctness

A natural intuition: if the model sounds sure, it's probably right. This is the most dangerous misconception in the field. The model's expressed confidence — and even its internal probability — is only loosely tied to whether the answer is true.

Figure 3.1 · Twelve answers, plotted Hover dots

✓ right & confident

✗ wrong & confident

right & hedged

wrong & hedged

→ stated confidence

↑ actually correct

low

high

true

false

The top-right quadrant — confident and wrong — is where most real damage gets done. A user can defend against a hesitant wrong answer; a confident one slides past.

This is called calibration failure. A well-calibrated model that says "I'm 80% sure" should be right 80% of the time. Modern LLMs, before any post-training tricks, are systematically over-confident — especially on the kinds of niche, specific questions where they're most likely to be wrong.

Section 04

What actually helps

You cannot remove hallucinations from a language model. You can make them rarer, easier to catch, or move responsibility for the answer somewhere else. Four techniques carry most of the weight.

Figure 4.1 · The four working mitigations Reference

Retrieval (RAG)Most effective

Before answering, fetch relevant documents from a trusted source and inject them into the prompt. The model now answers from text in its context, not from training memory. Hallucinations don't disappear, but they shift from invention to misreading.

Effectiveness Cost

Lower temperatureCheapest

For factual tasks, push T toward 0. The model will pick its most-probable token, which tends to be the most-supported-by-data token. Doesn't help on questions where the model itself is wrong, but cuts off the long tail of low-probability fabrications.

Effectiveness Cost

Refusal trainingBuilt-in

During post-training (RLHF, fine-tuning), reward the model for saying "I don't know" on questions it gets wrong. Modern frontier models do this aggressively. The risk is over-refusal: the model declining things it actually knows.

Effectiveness Cost

Verifier / second modelBelt & braces

Run the answer through a second model trained to check it — either against retrieved documents, or by re-generating and looking for consistency. Used in production for legal, medical, and code applications where wrong answers are expensive.

Effectiveness Cost

Most production stacks combine two or three. RAG + low temperature is the modern default; verifier models get added when the cost of being wrong is high enough to justify the latency.

Section 05

How to catch them

Three patterns do most of the heavy lifting in detection. None of them requires access to the model's internals — they all work from the outside, by re-asking, checking, or comparing.

Self-consistency: ask N times, look for agreement

Sample the same prompt at moderate temperature several times. If the model returns the same answer in 9 of 10 runs, it's probably learned. If every run gives a different specific number, it's probably guessing. This is cheap, model-agnostic, and surprisingly effective on factual questions.

Run 11911

Run 21911

Run 31909

Run 41911

Run 51911

Vote1911 (4/5)

Citation grounding: every claim must point somewhere

Force the model to cite a source for every factual claim, then verify each citation programmatically — does the URL exist, does the quoted passage actually appear in the document, does the date match? Citations are the part of the answer easiest to falsify, and falsifying them catches a huge fraction of hallucinations.

claim Acite ✓ exists

claim Bcite ✓ matches

claim Ccite ✗ 404

claim Dcite ✗ wrong text

Reject2 of 4 claims

Cross-check: pit two answers against each other

Get the same question answered by two systems — two models, two retrieval sources, the model with and without context — and surface the disagreements. The agreements are probably true; the disagreements are exactly where to look harder.

Model A1911, Chemistry

Model B1911, Chemistry

Wikipedia1911, Chemistry

Confidencehigh (3-way)

Section 06

What hallucinations are not

The word "hallucination" gets stretched to cover anything the model gets wrong. Most of these are different problems with different fixes.

Not lying.

Lying requires knowing the truth and saying otherwise. The model has no internal representation of truth — only of what is statistically likely to be said. A confident wrong answer is not malice; it is the same machinery that produces confident right answers, applied to a question it can't actually answer.

Not stale knowledge.

If the model says Joe Biden is the U.S. president after his term has ended, that's a knowledge cutoff problem, not a hallucination. The model is correct as of its training data. The fix is updating context (RAG, system prompt), not retraining.

Not user disagreement.

If the model gives a tonally awkward email or a structurally different essay than you wanted, that's a preference miss, not a hallucination. Hallucinations are about truth, not taste.

Not solvable by scaling alone.

Bigger models hallucinate less, but they don't hallucinate zero — and they hallucinate more confidently. Frontier models in 2025 still produce fabricated citations, made-up court cases, and confident wrong dates. Architecture changes the floor, not the ceiling.

Not always bad.

The same machinery that fabricates a citation also writes a poem, names a startup, or invents a metaphor. Generation that goes beyond the training data is the point of the technology. The trick is steering it: invention when you ask for invention, fidelity when you ask for fact.

Fluency is not truth.

Language models are professional sentence-finishers. Most of the time the most fluent sentence is also the true one. When it isn't, you get a hallucination — and you only catch it by checking, retrieving, or asking again.

Lesson 07 · Coming next

Retrieval-Augmented Generation: giving the model a library

→

Sources & further reading

Ji et al., Survey of Hallucination in Natural Language Generation — the canonical taxonomy paper.
Lin, Hilton & Evans, TruthfulQA — the benchmark that quantified the problem.
Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the original RAG paper.
Wang et al., Self-Consistency Improves Chain of Thought Reasoning — the basis for the agreement-vote pattern.
Kadavath et al., Language Models (Mostly) Know What They Know — calibration in modern LLMs.
Mata v. Avianca, S.D.N.Y. 2023 — the lawyer who filed fabricated case citations. Read it.

Five flavors, one mechanism

How the model gets there

Confidence is not correctness

What actually helps

How to catch them

Self-consistency: ask N times, look for agreement

Citation grounding: every claim must point somewhere

Cross-check: pit two answers against each other

What hallucinations are not

Not lying.

Not stale knowledge.

Not user disagreement.

Not solvable by scaling alone.

Not always bad.

Fluency is not truth.

Sources & further reading

Want to know when the next one drops?