Tokens.
Before an AI model can think about your words, it has to break them into tokens: surprising little chunks of text that shape spelling, counting, pricing, context, and model behavior.
Type something into a chatbot. Anything. The first thing that happens, before the model thinks, before it answers, before any of the magic, is that your sentence gets shredded into pieces.
Not into letters. Not into words. Into tokens: little chunks of text that the model learned to recognise during training. Some tokens are whole common words like the or and. Some are word fragments like ing or tion. Some are single characters. Some emoji are several tokens by themselves.
Tokens are how the model reads. They are how it counts. They are what you are paying for, when you pay. Almost everything weird about AI, from spelling mistakes to strange counting errors, comes back to tokens.
Let me show you.
Type something. Watch it shatter.
This is a real tokenizer running in your browser. The widget uses the o200k_base encoding from the OpenAI-family BPE tokenizers, bundled into this static page. Type whatever you like.
Notice the dots? Those are spaces. The tokenizer treats them as part of the next word, not as their own thing. So the with a space is a different token from the without one. In this encoding, even strawberry and strawberry can split differently.
That single design choice ripples through everything. It is why models sometimes seem to have trouble with the very first word of a reply, and why pasted whitespace can quietly change what a prompt costs.
A sentence becomes a list of numbers.
The model cannot read text. Not really. It reads numbers. Every token in the world has been assigned a unique number, an id, and what the model receives is a list of those ids.
Here, watch the transformation. Step through it.
That last list is the entire input. To the model, your sentence is not English. It is [13225, 11, 2375, 0]: a handful of numbers fed into a very large machine.
Everything the model knows about language was learned from patterns in lists of numbers like this.
This is also why a model has a vocabulary size. Modern tokenizers understand a finite set of token pieces, and every poem, email, emoji, code snippet, and prompt is represented from that set.
Same sentence. Different token shapes.
Every piece of text you give a model has a price, not in money here, but in tokens. Common English can be cheap. Punctuation can attach to a word. Numbers often split into chunks. Emoji can become byte-like pieces. Code and multilingual text have their own patterns.
Hover any pill below. The badge shows how many tokens that visible piece becomes.
The surprise is not only that rare words split. It is that ordinary-looking text can split for several different reasons: a comma, a repeated word with a leading space, a long number, a non-Latin script, an emoji modifier, or a tiny piece of code.
But the real surprise is emoji.
A single emoji looks like one character on your keyboard. Inside the tokenizer, it is something else entirely. Click any emoji to watch it come apart.
None of this means you should stop using emoji. But now you know what the model is actually seeing, and why it might respond to a thumbs-up differently than to the word yes.
Tokens are a compression trick, not a theory of language.
Byte Pair Encoding began as a compression idea: keep replacing frequent adjacent pieces with a new combined piece. Language models adapted that idea because it solves a practical problem. Whole-word vocabularies waste space on rare words, character vocabularies make sequences painfully long, and subword vocabularies sit in the middle.
The important thing is that tokens are learned from frequency. They are not little dictionary entries. They do not know that un is a prefix or that berry is a fruit. They are just pieces that were useful enough, often enough, to earn a slot.
Tokenization is one of the places where language, compression, and economics touch the same wire.
Four things worth remembering.
1. Tokens explain context limits.
A context window is counted in tokens, not pages, words, or characters. Short English prose, dense code, emoji-heavy chat, and multilingual text can all consume the window at different rates.
2. Token IDs are not meanings.
The ID 13225 does not mean "hello" by itself. It is an index. The model turns that index into a vector, and meaning emerges from patterns learned across many contexts.
3. Tokenization changes over model generations.
OpenAI's tokenizer examples compare older encodings with cl100k_base and o200k_base; the same string can split differently under each encoding. That is why token counts are model-specific.
4. Byte-level tokenizers have a safety valve.
If the tokenizer can fall back to bytes, it can represent arbitrary text instead of giving up on unknown characters. That is useful for multilingual text, noisy text, code, and emoji, even when individual byte pieces look strange on screen.
What this lesson is based on.
The widget is intentionally not a toy splitter anymore. It uses a browser-bundled JavaScript tokenizer for o200k_base, the modern OpenAI-family BPE encoding used by GPT-4o-class models. The explanations above are grounded in these primary references:
- OpenAI tiktoken: OpenAI's BPE tokenizer library, including examples for o200k_base and notes that BPE is reversible, handles arbitrary text, compresses text, and tends to expose common subwords.
- OpenAI Cookbook token counting guide: Shows how to count tokens and compares token IDs/bytes across r50k_base, p50k_base, cl100k_base, and o200k_base.
- Sennrich, Haddow, and Birch, ACL 2016: The classic neural machine translation paper that popularized subword units with BPE for rare-word handling.
- Wang et al., 2019: A byte-level subword paper showing why byte-level vocabularies are attractive for noisy and multilingual text.
Next lesson
Tokens are only the doorway.
Next: embeddings. Once text has become token IDs, the model turns each ID into a position in a learned space. That is where the next strange thing begins.
Get notified when it is done