LESSON 02 26 min read Published May 2026

Embeddings:
where words
become geography.

Tokens turn text into numbers the model can index. Embeddings turn those numbers into positions — and once meaning has a location, the rest of modern AI falls into place.

In the last lesson we watched a sentence get sliced into tokens — small reusable pieces, each with an integer ID. The model receives a list like [15339, 11, 1917, 0] and… does what, exactly? It can't add the numbers. It can't compare them. 15339 isn't bigger than 11 in any meaningful way; the IDs are just labels.

So before any thinking happens, every token ID is replaced by a long list of decimal numbers. That list is called an embedding. It's the model's way of giving each token a position in a high-dimensional space. Words with similar meanings end up near each other. Words with different meanings end up far apart.

This lesson is about that space — what it looks like, why it works, and the three things every AI app you build will eventually do with it: compare, compute, and retrieve.

A token is a name.
An embedding is an address.
Section 01

From IDs to meaning

The bridge from the previous lesson to this one is short but important. The tokenizer has already split your text and assigned each piece a number. That number is just a row index into a giant lookup table called the embedding matrix. For a model like GPT-4o, that table has roughly 200,000 rows (one per vocabulary item) and each row is around 3,000 numbers wide.

Type a word below to watch the trip: text → token → ID → vector.

Figure 1.1 · The token-to-vector bridge Live
Token
embedding
id 15339
Embedding (24 of ~3,000 dims)

Each token ID indexes one row of the embedding matrix. We're showing 24 of the ~3,000 numbers; the real vector is much longer.

The numbers themselves are meaningless to look at. No single dimension means "is an animal" or "is plural." Meaning lives in the combination of all of them — in the precise position the vector occupies relative to every other vector.

Section 02

The shape of meaning

We can't draw a 3,000-dimensional space, but we can flatten it down to two and still see something honest. Below is a small slice of an embedding space, projected onto a 2D plane. Hover any word to see its nearest neighbors.

Figure 2.1 · A 2D projection of an embedding space Hover
dim 1 →
↑ dim 2

Notice that nothing labeled the clusters. Animals huddle with animals; cities huddle with cities; happy words cluster apart from sad ones — purely from the patterns the model saw in training text.

This is the central magic. The model was never told cat and kitten are related. It read enough sentences to notice they appear in similar contexts ("my ___ is hungry," "the ___ is purring") and so their addresses ended up next to each other. The geometry is a side effect of statistics — and yet it lines up with how humans think about meaning.

Section 03

Closeness is the whole point

Once meaning is a location, "are these two ideas similar?" becomes "how close are these two points?" The standard ruler is cosine similarity: the cosine of the angle between two vectors. It's 1 when they point the same way, 0 when they're perpendicular, -1 when they point opposite. In practice we usually see numbers from about 0 to 1.

Figure 3.1 · Phrase-to-phrase similarity Live
0.00

Try: "I love coffee" vs "espresso is great", then "I love coffee" vs "the dog is hungry". Same sentence shape, very different scores.

Notice what is not happening: there's no keyword overlap requirement. "Puppy" and "dog" share zero letters with each other, but their vectors point in nearly the same direction, so cosine similarity rates them as almost the same idea. This is the property keyword search has spent thirty years pretending it had — and embeddings give it for free.

Section 04

Math on meaning

If meaning is a position, then directions in the space carry meaning too. Move from man to woman, and you've traveled along a "gender" direction. Move from Paris to France, and you've traveled along a "capital → country" direction. The same direction works for any pair on that axis. So we can do arithmetic.

Figure 4.1 · Vector arithmetic on words Editable
king man + woman = queen

Tap a preset, or click any blue/red/green word to edit it. The model never learned "queen is the female king" — it just put the four words at the corners of a tiny parallelogram.

Word2vec made this trick famous in 2013. Modern transformers use the same idea but at a much richer scale: every position in the embedding space encodes hundreds of overlapping axes — gender, plurality, formality, sentiment, era, tone, and many more we haven't named.

Section 05

Why your app needs this

Here is the practical payoff. Suppose you have a pile of documents — help articles, news clips, code snippets, customer messages — and a user asks a question. You want to find the relevant pieces. Old-school keyword search needs the user's words to literally appear in the document. Embedding search just needs the meaning to overlap.

Figure 5.1 · Semantic search over a small corpus Live

Toggle between modes. The query says nothing about Stripe or virtual cards, yet embedding search surfaces them. Keyword search misses everything that doesn't share literal words.

This pattern — embed the documents, embed the query, return the closest matches — is the engine behind retrieval-augmented generation (RAG), semantic recommenders, anti-spam clustering, deduplication, and most "AI search" features shipping in production today.

Embeddings are the verb of modern AI infrastructure. Models speak; vectors find.
Section 06

What gets lost

A 3,000-dimensional vector is rich but not infinite. Compress it further — to 384 dims, to 64, to 2 — and the space starts to lie. Drag the slider. Watch how the five clean clusters smear together when there isn't enough room to keep them apart.

Figure 6.1 · The cost of fewer dimensions Drag
2D 1536D 1536D
Information loss0%
Distinct clusters5 / 5
Storage / vector6 KB

The smaller the vector, the cheaper to store and the faster to search — but the more meaning collapses. Picking the right dimension count is the central trade-off in vector databases.

A few honest caveats before you ship

Embeddings inherit the training set's biases.

If the corpus paired nurse with she and doctor with he, those associations are baked into the geometry. You will measure them later, often as a surprise.

Different models live in different spaces.

An OpenAI vector and a Cohere vector cannot be compared directly — the axes mean different things. Keep one provider per index, and re-embed everything when you migrate.

"Similar" isn't always "right."

Cosine similarity rewards topical closeness, not factual correctness. Two paragraphs that contradict each other on the same topic will still score high. Add a reranker for anything safety-critical.

Cost is a function of dimensions × documents × queries.

A million 1,536-dim vectors is roughly 6 GB before any index overhead. Pick the smallest dimension that holds your task, and quantize when you can.

That's the whole core idea.

Tokens are how the model reads. Embeddings are how the model thinks about what it just read. Everything else — attention, generation, agents — is built on top of this map.

Lesson 03 · Next
Attention: how models decide what to look at

Sources & further reading

  1. Mikolov et al., Efficient Estimation of Word Representations in Vector Space — the original word2vec paper, 2013.
  2. OpenAI, New embedding models and API updates — text-embedding-3-small / large, 2024.
  3. Anthropic, Contextual retrieval — practical RAG patterns and reranking, 2024.
  4. Cohere, Embed v3 documentation — multilingual embeddings and quantization tradeoffs.
  5. Pinecone, What is a vector database? — accessible primer on indexing high-dim vectors.