Temperature:
the one knob
that changes everything.
A model produces a probability distribution. Temperature is a single number that decides how peaky or flat that distribution is before we sample. Turn it down and the model is a scribe. Turn it up and it's a dreamer. Turn it up too far and it's incoherent.
Every time a language model picks a word, it does the same two-step. First, it computes a score for every possible token in the vocabulary. Second, it turns those scores into probabilities and draws one. Temperature is a single multiplicative knob that lives between step one and step two. It does not change the model. It does not change the prompt. It only changes how aggressively the model commits to its top guess.
That sounds small. It is not. The same prompt at T=0.2 writes you a polite email; at T=1.5 it writes you a fever dream. The same engine, the same weights, the same input — and a knob deciding how much chance is allowed in the room.
It is controlled creativity.
What temperature does to a distribution
The clearest way to see temperature is to pick a real prompt, look at the model's top eight candidates, and watch what happens to the bars as you sweep the knob. Drag the slider. Watch the leader's percentage rise and fall.
Notice: at low T, almost all the probability piles onto "began." At high T, the bars flatten — even unlikely words like "vanished" get a real chance. Entropy is the math word for that flatness.
Two limits are worth holding in your head. As T → 0, sampling becomes deterministic — the model picks the top word every time, also called greedy decoding. As T → ∞, every word in the vocabulary becomes equally likely; the model is reduced to a uniform random word generator.
The math, in four steps
Temperature is one division. That's it. Here is the entire pipeline from the model's last layer to the chosen word — click any step to see the numbers up close.
Temperature is a single, well-placed division. The reason it has so much influence is that softmax is exponential — small changes in the gaps between logits become big changes in the gaps between probabilities.
One way to feel why this works: softmax cares about differences between logits, not their absolute values. Dividing every logit by 0.5 doubles every gap, which after exponentiation explodes the leader's lead. Multiplying by 2 (the other direction) halves every gap and lets the also-rans catch up. Temperature is a gain knob on confidence.
Three writers, one prompt
The math is one thing. The feel of it on real text is another. Here is the same noir prompt sampled at three temperatures. Hit reroll to see the variance at each setting.
Cold tends toward cliché — it picks the most-traveled phrase available. Warm finds the unexpected-but-fitting. Hot will sometimes give you a great line and sometimes give you a sentence that doesn't quite parse.
The trade-off is real. At low temperature you get reliability at the cost of variety; multiple runs return the same answer or near-copies. At high temperature you get variety at the cost of reliability; one run delights, the next derails. Most production systems live around T = 0.7 — close enough to default to feel natural, low enough to keep the model on a leash.
Temperature vs top-k vs top-p
Temperature is one of three sampling knobs you'll see in every API. They do different things and combine. Confusing them is the most common decoding mistake.
These compose. A common recipe: T = 0.7, top-p = 0.9. Temperature softens the distribution, then top-p chops the long tail of nonsense before sampling.
What temperature for what task
There is no universal default. The right setting depends on whether you want one correct answer or many plausible ones. Use this as a starting point, not a rule.
If you don't know, start at 0.7 and only move once you have a complaint.
What temperature is not
Temperature is famous, which means it gets blamed for things it doesn't do.
It is not a creativity dial.
It cannot make a bad model good. If the model never learned a concept, no temperature will recover it — high T just produces a higher-entropy version of the same ignorance. Creativity comes from training; temperature only governs how willing the model is to deviate from its top guess.
It is not the same as randomness.
The randomness lives in the final sampling step, which happens regardless of T. Temperature only changes the shape of the distribution being sampled from. T=0 with a fixed seed is fully deterministic; T=2 with the same seed is also reproducible — same shape, same draw.
T = 0 is not literally zero.
Most APIs treat T=0 as "use greedy decoding" rather than actually dividing by zero. Some implementations still have small floating-point nondeterminism at T=0 from GPU kernel scheduling, so even greedy isn't always bit-identical between calls.
Higher temperature does not increase factuality.
If anything, the opposite. Hallucinations are more likely at high T because low-probability tokens — including incorrect entities, dates, and citations — get more chances. For factual tasks, lower is safer.
Temperature is set by the caller, not the model.
It's a runtime knob in the sampling loop, not a property of the weights. The same model can be a deterministic code-completer and a wild brainstormer in the same afternoon, just by changing one number on the API call.
One number. The whole personality.
Logits, divided by T, softmaxed, sampled. That's the whole story. The model is fixed; temperature is the door you leave open between what's most likely and what's merely possible.
Sources & further reading
- Ackley, Hinton & Sejnowski, A Learning Algorithm for Boltzmann Machines — where the temperature analogy comes from.
- Holtzman et al., The Curious Case of Neural Text Degeneration — the nucleus (top-p) sampling paper.
- Fan, Lewis & Dauphin, Hierarchical Neural Story Generation — top-K sampling for open-ended text.
- Wang et al., Self-Consistency Improves Chain of Thought Reasoning — sampling at higher T then voting.
- OpenAI, Anthropic & Google API references — the actual ranges and defaults you'll meet in production.