LESSON 05 24 min read Published May 2026

Temperature:
the one knob
that changes everything.

A model produces a probability distribution. Temperature is a single number that decides how peaky or flat that distribution is before we sample. Turn it down and the model is a scribe. Turn it up and it's a dreamer. Turn it up too far and it's incoherent.

Every time a language model picks a word, it does the same two-step. First, it computes a score for every possible token in the vocabulary. Second, it turns those scores into probabilities and draws one. Temperature is a single multiplicative knob that lives between step one and step two. It does not change the model. It does not change the prompt. It only changes how aggressively the model commits to its top guess.

That sounds small. It is not. The same prompt at T=0.2 writes you a polite email; at T=1.5 it writes you a fever dream. The same engine, the same weights, the same input — and a knob deciding how much chance is allowed in the room.

Temperature is not creativity.
It is controlled creativity.
Section 01

What temperature does to a distribution

The clearest way to see temperature is to pick a real prompt, look at the model's top eight candidates, and watch what happens to the bars as you sweep the knob. Drag the slider. Watch the leader's percentage rise and fall.

Figure 1.1 · Reshaping the next-token distribution Live
"She opened the old letter and ___"
0 · frozen1 · default2 · chaos

Notice: at low T, almost all the probability piles onto "began." At high T, the bars flatten — even unlikely words like "vanished" get a real chance. Entropy is the math word for that flatness.

Two limits are worth holding in your head. As T → 0, sampling becomes deterministic — the model picks the top word every time, also called greedy decoding. As T → ∞, every word in the vocabulary becomes equally likely; the model is reduced to a uniform random word generator.

Section 02

The math, in four steps

Temperature is one division. That's it. Here is the entire pipeline from the model's last layer to the chosen word — click any step to see the numbers up close.

Figure 2.1 · From logits to a sampled token Click steps

Temperature is a single, well-placed division. The reason it has so much influence is that softmax is exponential — small changes in the gaps between logits become big changes in the gaps between probabilities.

One way to feel why this works: softmax cares about differences between logits, not their absolute values. Dividing every logit by 0.5 doubles every gap, which after exponentiation explodes the leader's lead. Multiplying by 2 (the other direction) halves every gap and lets the also-rans catch up. Temperature is a gain knob on confidence.

Section 03

Three writers, one prompt

The math is one thing. The feel of it on real text is another. Here is the same noir prompt sampled at three temperatures. Hit reroll to see the variance at each setting.

Figure 3.1 · "Write the opening line of a noir detective story." Reroll

Cold tends toward cliché — it picks the most-traveled phrase available. Warm finds the unexpected-but-fitting. Hot will sometimes give you a great line and sometimes give you a sentence that doesn't quite parse.

The trade-off is real. At low temperature you get reliability at the cost of variety; multiple runs return the same answer or near-copies. At high temperature you get variety at the cost of reliability; one run delights, the next derails. Most production systems live around T = 0.7 — close enough to default to feel natural, low enough to keep the model on a leash.

Section 04

Temperature vs top-k vs top-p

Temperature is one of three sampling knobs you'll see in every API. They do different things and combine. Confusing them is the most common decoding mistake.

Figure 4.1 · The three knobs, side by side Reference
Temperaturereshapes
Scales every logit by 1/T before softmax. Affects the whole distribution. Doesn't cut anyone off.
p_i ∝ exp(z_i / T)
A volume knob on confidence.
Top-Ktruncates
After softmax, keep only the K highest-probability tokens; renormalize and sample from those. K=1 is greedy.
keep top K, drop the rest
A hard cap on the candidate pool.
Top-p (nucleus)truncates
Keep the smallest set of tokens whose probabilities add up to p (e.g. 0.9). Adapts to how peaky the distribution already is.
keep until Σp ≥ p
A soft cap that scales with confidence.

These compose. A common recipe: T = 0.7, top-p = 0.9. Temperature softens the distribution, then top-p chops the long tail of nonsense before sampling.

Section 05

What temperature for what task

There is no universal default. The right setting depends on whether you want one correct answer or many plausible ones. Use this as a starting point, not a rule.

Figure 5.1 · Temperature recipes by use case Cheat-sheet
Task
T
Setting
Why
Code generation
0.0–0.2
Cold / greedy
There's usually one right answer. Variety isn't useful; correctness is.
Math & logic
0.0–0.3
Cold
Same. You want the same answer twice. Some research uses self-consistency at higher T then majority-votes.
Q&A & retrieval
0.2–0.5
Cool
Mostly factual. A little softness lets the model phrase its answer naturally.
Chat / agents
0.6–0.9
Warm (default)
Sound human, vary phrasing across turns, but stay reliable when the user asks twice.
Brainstorming
0.9–1.2
Warm-hot
You want different answers each run. Combine with multiple samples to surface options.
Creative writing
1.0–1.5
Hot
The whole point is novelty. Pair with top-p ≈ 0.9 to keep grammar from melting.
Adversarial / fuzzing
1.5+
Wild
Trying to find weird outputs on purpose — red-teaming, sampling edge cases, dataset diversity.

If you don't know, start at 0.7 and only move once you have a complaint.

Section 06

What temperature is not

Temperature is famous, which means it gets blamed for things it doesn't do.

It is not a creativity dial.

It cannot make a bad model good. If the model never learned a concept, no temperature will recover it — high T just produces a higher-entropy version of the same ignorance. Creativity comes from training; temperature only governs how willing the model is to deviate from its top guess.

It is not the same as randomness.

The randomness lives in the final sampling step, which happens regardless of T. Temperature only changes the shape of the distribution being sampled from. T=0 with a fixed seed is fully deterministic; T=2 with the same seed is also reproducible — same shape, same draw.

T = 0 is not literally zero.

Most APIs treat T=0 as "use greedy decoding" rather than actually dividing by zero. Some implementations still have small floating-point nondeterminism at T=0 from GPU kernel scheduling, so even greedy isn't always bit-identical between calls.

Higher temperature does not increase factuality.

If anything, the opposite. Hallucinations are more likely at high T because low-probability tokens — including incorrect entities, dates, and citations — get more chances. For factual tasks, lower is safer.

Temperature is set by the caller, not the model.

It's a runtime knob in the sampling loop, not a property of the weights. The same model can be a deterministic code-completer and a wild brainstormer in the same afternoon, just by changing one number on the API call.

One number. The whole personality.

Logits, divided by T, softmaxed, sampled. That's the whole story. The model is fixed; temperature is the door you leave open between what's most likely and what's merely possible.

Lesson 06 · Next
Hallucinations: when the model is confidently wrong

Sources & further reading

  1. Ackley, Hinton & Sejnowski, A Learning Algorithm for Boltzmann Machines — where the temperature analogy comes from.
  2. Holtzman et al., The Curious Case of Neural Text Degeneration — the nucleus (top-p) sampling paper.
  3. Fan, Lewis & Dauphin, Hierarchical Neural Story Generation — top-K sampling for open-ended text.
  4. Wang et al., Self-Consistency Improves Chain of Thought Reasoning — sampling at higher T then voting.
  5. OpenAI, Anthropic & Google API references — the actual ranges and defaults you'll meet in production.