LESSON 04 32 min read Published May 2026

Transformers:
the architecture
behind everything.

You've met tokens, embeddings, and attention. Now we put them together. The transformer is the assembly — a stack of identical blocks that takes a sentence in and predicts a next word out. Every modern LLM is a tall version of this same diagram.

Three lessons in, you have all the parts. Tokens turn text into integers. Embeddings turn integers into positions. Attention lets every position consult every other position. None of these alone is impressive. The trick is the assembly: stack the same block 12, 32, or 96 times, train it on most of the public internet, and out the other side comes something that can write code, summarize papers, and pay merchants on your behalf.

The transformer architecture, introduced in 2017, is so dominant that almost every model you've heard of — GPT, Claude, Llama, Gemini, Mistral — is a variant of the same pattern. The differences are mostly in the training data and the size. This lesson walks the diagram, top to bottom, until the whole thing makes sense as one machine.

The transformer is not a clever idea.
It is a deep idea, repeated.

Section 01

The whole pipeline, end to end

Before we open up any single block, here is the entire journey one word takes through the model. Type a sentence and step through. Each stage takes the previous one's output and transforms it.

Figure 1.1 · One word's journey through the model Step / Play

1 · Token

"merchant"

2 · ID

id 47291

3 · Embed

[…]

4 · N blocks

attn → mlp

5 · Predict

→ next

The last word in your input is what the model is trying to extend. We follow it from string, to ID, to vector, through the stack, to a probability over every possible next token.

The first three stages you already know. The middle stage — the stack of identical blocks — is where everything interesting happens, and where every parameter the model has learned actually lives. Each block is the same shape; the model gets smart by doing this same shape over and over again with different learned weights.

Section 02

Inside one block

Open up a single transformer block and you find six steps in a fixed order. Two are normalizations, two are residual additions, one is attention, one is a per-token feed-forward network. The whole stack of GPT-4o is just this pattern, ninety-six times.

Figure 2.1 · A single transformer block, dissected Pick a step

Click any step to see what it does and why. Two operations actually transform information — attention and the MLP. The other four exist to make training possible.

Two of these steps deserve their own beat. Attention is the only place tokens exchange information — every other operation in the block runs independently on each token. The MLP is where most of the model's parameters live. If attention is the model's routing, the MLP is the model's knowledge. Modern interpretability research suggests facts like "Paris is in France" are stored in MLP weights, not attention.

Section 03

The residual stream

Notice that every sublayer in a block ends with add the result back to the input. This isn't a detail — it's the central design move. Each token's representation is a running vector that flows up through the entire stack, and every block writes a small edit to it. Researchers call this the residual stream: a shared bus that all blocks read from and write to.

Figure 3.1 · The residual stream as it accumulates Diagram

unchanged edited strongly edited

Each column is the same vector at a deeper layer. Cells that change layer-to-layer are where a block decided this dimension needed an edit. Most cells stay quiet most of the time — the model edits selectively.

This is why transformers can be ninety-six layers deep without falling apart. Without the residual stream, gradients would have to flow through every multiplication on the way down, and they would shrink to nothing. With it, every layer has a direct path to the input — a highway underneath the math.

Section 04

From numbers to a next word

After the last block, the model has one vector per token. To predict the next word, it takes the very last token's vector, multiplies it by the embedding matrix transposed (called unembedding), and gets a score for every possible token in the vocabulary — about 200,000 of them for GPT-4o. Softmax those scores into probabilities, then sample. That last step is where the model's personality comes from.

Figure 4.1 · Picking the next token Live

The agent paid the…

Temperature 1.00

Top-K 8

Temperature reshapes the probabilities; top-K cuts off the long tail. Drop temperature to 0.1 and the model becomes deterministic. Raise it past 1.5 and even nonsense gets a chance.

Then — and this is the loop that makes the whole illusion of fluency work — the chosen token is appended to the input, and the model runs again. Every word you see streaming out of an LLM is one full pass through the entire stack. A 200-word reply means 200 forward passes. Generation is just the model talking to itself, one token at a time.

Section 05

Same shape, different scales

The architecture has barely changed since 2017. What has changed is the scale. The 2017 transformer had 65 million parameters. GPT-2 had 1.5 billion. GPT-3 had 175 billion. GPT-4 is rumored to be over a trillion. Each is the same diagram you just walked, with more layers and wider blocks.

Figure 5.1 · The shape of a few transformers you've heard of Reference

Model

Layers

Hidden dim

Parameters

Original Transformer · 2017

512

65 M

BERT-base · 2018

768

110 M

GPT-2 · 2019

1,600

1.5 B

GPT-3 · 2020

12,288

175 B

Llama 3.1 70B · 2024

8,192

70 B

GPT-4-class · 2024

~120

~18,000

~1.7 T

Same blueprint. More floors. Wider rooms.

Section 06

What the diagram doesn't show

The architecture is clean. Everything else about a real model is messy.

Training is most of the work.

The architecture is maybe 1% of the difficulty. The other 99% is curating trillions of tokens of data, running stable training across thousands of GPUs for months, and then post-training (RLHF, fine-tuning, safety tuning) to make the raw model useful and not catastrophic.

Inference is where most cost lives.

Every word you generate is a full pass through the stack. A long reply from a 70B model can take billions of multiplications. KV-caching, quantization, speculative decoding — most of modern serving is about making this loop cheaper.

The original is encoder-decoder. Modern LLMs are decoder-only.

The 2017 paper had two stacks (one to read, one to write). GPT-style models use only the decoder side with a causal mask, which turned out to be enough for almost everything. Encoder-decoder lives on in translation and image generation.

We can describe it precisely. We cannot fully explain it.

You can write the entire forward pass in 200 lines of PyTorch. Why a particular trained model recovers a specific fact, or chooses a specific tone, is still mostly opaque. The architecture is solved; the artifact it produces is not.

You now know the architecture behind every LLM.

Tokens. Embeddings. Attention. Stacked into blocks. Trained on the internet. Sampled one word at a time. The four lessons fit together into one diagram, and that diagram is the modern frontier.

Lesson 05 · Next

Temperature: the one knob that changes everything

→

Sources & further reading

Vaswani et al., Attention Is All You Need — the architecture, NeurIPS 2017.
Andrej Karpathy, Let's build GPT from scratch — the canonical implementation walkthrough.
Anthropic, A Mathematical Framework for Transformer Circuits — the residual-stream view of computation.
Radford et al., Language Models are Unsupervised Multitask Learners — GPT-2.
Touvron et al., LLaMA & Llama 3 technical reports — modern decoder-only specifics.
Hoffmann et al., Training Compute-Optimal Large Language Models — the Chinchilla scaling laws.