LESSON 03 28 min read Published May 2026

Attention:
how models
decide what to look at.

Embeddings give every word a position. Attention is what the model does next: at every step it asks each word which ones matter right now.

Read this sentence: The agent paid the merchant after checking its budget. When you arrive at the word its, you don't pause and consider every word equally. Your eye flicks back to agent. That single move — looking selectively at the parts of the past that matter — is the entire idea behind attention.

Before the transformer, neural networks processed text one word at a time and tried to cram everything they'd seen so far into a single fixed-size memory. It worked for short sentences and broke for long ones. The 2017 paper "Attention Is All You Need" replaced that memory with something simpler: at every step, let every token look directly at every other token, and let it decide for itself which ones to care about.

Embeddings give meaning a place.
Attention gives meaning a relationship.
Section 01

What "looking" looks like

Take the classic puzzle: The trophy didn't fit in the suitcase because it was too big. What does it mean? You know without thinking — but for a model, this is a real choice. Attention is how it makes that choice. Hover any word below to see what it pays attention to.

Figure 1.1 · Attention weights for one head Hover

Hover 'it' — most of its attention flows back to 'trophy', not 'suitcase'. The model has resolved the pronoun.

The colored fill on each token is its attention weight — how much of the focal word's "looking budget" went there. Every focal token has a budget of exactly 1.0, so the weights always sum to one. That's the whole interface: a probability distribution over the rest of the sentence.

Section 02

Many heads, many habits

A single attention pattern can only capture one kind of relationship at a time. So transformers run several patterns in parallel — typically 12 to 96 of them — and each becomes a specialist. Some heads track pronouns. Some lock onto the previous word. Some always look back at the start of the sentence. Others learn things we don't have names for.

Figure 2.1 · Four heads on the same sentence Pick a head

No one programmed these specialties. They emerged from training. When researchers later peer inside a trained model, they often find heads that look uncannily linguistic — one head consistently links subjects to verbs, another links determiners to their nouns. The model has rediscovered grammar by accident.

Section 03

The math: query, key, value

Here's the mechanism in one paragraph. Every token's embedding is multiplied by three different learned matrices to produce three new vectors: a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what do I contribute if picked?"). To compute the attention from token A to every other token, dot A's query with every key. Big dot product means strong match. Softmax the scores into a clean distribution. Then weighted-sum the values. That weighted sum becomes A's new representation.

Figure 3.1 · Q, K, V vectors for your sentence Live
attention(Q, K, V) = softmax(QKᵀ / √d) V Three matmuls and a softmax. That's it.

These vectors are made up — real Q/K/V live in 64–128 dimensions per head. The arithmetic, though, is exactly this.

Section 04

Looking only at the past

When a model is generating text — predicting the next token — it can't peek at the future, because the future hasn't been written yet. So during training the attention pattern is masked: each token can attend to itself and earlier tokens, but never to later ones. This is the causal mask, and it's what makes GPT-style models work.

Figure 4.1 · The causal mask on a 5-token sentence Step or play

Green cells: token can look there. Red cells: blocked. Diagonal: looking at itself. The lower triangle is the only attention a generative model is allowed.

Encoder models like BERT skip the mask — they read the whole sentence at once, both directions, which is why they're great for classification and bad for generation. Decoder models like GPT keep the mask and trade bidirectional context for the ability to write.

Section 05

Sharp or blurry: the temperature knob

Softmax has a hidden dial. Before computing it, divide all scores by a number called temperature. Low temperature exaggerates differences — the top score wins by a landslide. High temperature flattens them — every option becomes plausible. This is the same knob you've seen in temperature=0.7 in API calls, applied here to attention itself.

Figure 5.1 · Softmax distribution over six candidates Drag
0.05 3.0 T = 1.00

Medium temperature → a clear winner with realistic uncertainty.

The same logic governs attention sharpness. When a head is confident, its softmax output is spiky and almost all weight goes to one token. When it's uncertain, the weight smears across many tokens. Watching how this distribution shifts across layers is one of the main tools researchers use to understand what a model is doing.

Section 06

What attention isn't

Attention is the most-discussed mechanism in modern AI, which means it's also the most over-attributed. A few corrections worth carrying with you:

Attention weights are not explanations.

It's tempting to say "the model attended to trophy, therefore it understood the pronoun." Researchers have shown you can shuffle attention weights surprisingly far without changing a model's output. The weights are part of the computation, not a transcript of its reasoning.

Attention is quadratic.

Every token attends to every other token, so the cost grows as the square of context length. Doubling your context window means quadrupling the attention compute. This is why long-context models are expensive and why much of recent research is about cheaper approximations (sliding window, sparse, linear attention).

One head ≠ one concept.

The pretty examples — "this head does pronouns" — are real but rare. Most heads in a real model are doing diffuse, polysemantic work that doesn't map onto a single human-readable pattern. Be skeptical of any clean story.

The architecture is just plumbing.

Attention is what makes transformers capable, not what makes them smart. The intelligence — such as it is — comes from training on enormous text corpora. Attention is the mechanism that lets that training stick.

Three lessons in, you have the spine of a transformer.

Tokens chunk the input. Embeddings give each chunk a position. Attention lets every position consult every other. Stack that block 96 times and you have GPT.

Lesson 04 · Next
Transformers: putting all the pieces together

Sources & further reading

  1. Vaswani et al., Attention Is All You Need — the original transformer paper, NeurIPS 2017.
  2. Jay Alammar, The Illustrated Transformer — the visual primer everyone learns from.
  3. Anthropic, A Mathematical Framework for Transformer Circuits — modern interpretability via attention.
  4. Clark et al., What Does BERT Look At? — empirical study of attention head specialization.
  5. Jain & Wallace, Attention Is Not Explanation — the case against reading too much into the weights.