uhhh what is a token?

AI Doesn't Read. Here's What It Actually Does.

A plain-language explainer on tokens — the unit that makes AI work

Growing up, a token was something I bought at the arcade — a small metal disc, worth exactly one game. I'd trade my quarters at the counter, spend them one at a time at whatever machine I was stuck on, and when they were gone, I was done. Later it was the coin you needed to unlock a shopping cart at Superstore (a staff member slipped me one when my pockets were empty, so while I had it I no longer had to keep a loonie around), or a picture to represent my sorcerer in a virtual D&D campaign.

The word has turned up again, decades after my arcade days, in a completely different context. When you use AI tools like ChatGPT or Claude, you're spending tokens. Not metal discs, but the underlying logic is surprisingly similar: a unit of exchange that translates real money into machine access, consumed one at a time, in ways most users never see.

Understanding what an AI token actually is, and how they accumulate, can help you use these tools more effectively and think more critically about their limits.

What Is a Token?

A token is a chunk of text. Not a word, not a letter ... somewhere in between. The word cat is one token. The word unbelievable might be three or four. A space, a punctuation mark, or a common syllable like -ing can each be their own token. Numbers, code, and less common words tend to break into more tokens than everyday English. The number 1 is a single token. But π is another matter: its digits never settle into a repeating pattern, so writing it out to any precision — 3.14159265358979… forces the model to spend a fresh token every few digits, with no compact shorthand to fall back on. One symbol we can write in an instant becomes an endless ribbon of tokens. Somehow, irrational numbers are once again surfacing to complicate my life.

When you type a prompt, the AI doesn't see your words. It sees a stream of numbers — one for each token. Your message, "Can you summarize this report?" becomes something like [5211, 345, 17, 8765, 23, 190]. The model works entirely in this numerical space (Sennrich et al., 2016).

The One Trick Behind Everything

Here's the part that surprises most people: the model's entire job is to predict the next token.

That's it. Given the sequence of tokens so far, what token is most likely to come next? The model does this one calculation, very fast, very many times, and the result looks like reasoning, summarizing, translating, or drafting. It isn't that the model understands language the way you do. It has learned, from an enormous amount of text, which tokens tend to follow which other tokens in which contexts (Brown et al., 2020; Vaswani et al., 2017).

This is why AI can seem fluent and occasionally be wrong in strange ways at the same time. Fluency comes from pattern, not comprehension.

Why Tokens Show Up in Your Life

If you use AI tools professionally, tokens matter in two immediate ways.

Cost. Most AI APIs (the behind-the-scenes connections businesses use to integrate AI) are priced per token; and input and output are typically priced differently, with generated output costing more than what you send in. A long document you paste in costs more to process than a short question; a lengthy generated report costs more than a brief summary.

Context limits. Every model has a maximum number of tokens it can hold at once. More on this below.

What the Model Can See

Every conversation you have with an AI (your prompts, the model's responses, any documents you paste in) accumulates in something called a context window. Think of it as the model's working memory: everything inside it is what the model can "see" when formulating a response. Everything outside it doesn't exist.

Context windows are measured in tokens, and the ceiling varies by vendor and keeps rising. Claude's Opus and Sonnet models, for example, now offer a 1-million-token window (Anthropic, 2026a) — roughly 750,000 words, or about ten full novels. Maybe that sounds much bigger than you will need ... but tokens add up faster than you'd expect: a detailed prompt, a few rounds of back-and-forth, and a pasted document can consume tens of thousands of tokens before you've noticed.

This raises a fair question: in an ongoing chat, what exactly does the model re-read each time you hit enter? The answer is all of it. A language model has no memory between turns. Each time you send a message, the entire conversation so far, every earlier prompt, every earlier reply, and every document you've pasted, is bundled together and fed back through the model as fresh input. It doesn't recall the previous turn the way a person would; it re-reads the whole transcript from the top, every single time, and only then predicts its next token. This is why a long conversation gradually feels slower and costs more: the input keeps growing with every exchange.

When a conversation finally outgrows the context window, tools diverge in how they cope. Some simply drop the oldest turns to make room. Others quietly summarize earlier parts of the conversation and carry the shorter summary forward instead of the verbatim text. Still others use retrieval — keeping the history elsewhere and pulling back only the snippets relevant to your latest question. Each approach trades completeness against cost, and each explains why a tool can sometimes seem to "forget" something you said much earlier.

Two things happen as you approach the limit. First, older content may get truncated, quietly dropped, to make room. Second, and less obviously, models don't attend to everything in a long context equally. Research has found that models tend to perform better on information near the beginning or end of a context, and noticeably worse on material buried in the middle (Liu et al., 2024). If you paste a 40-page report and the critical figure is on page 22, the model may handle it less accurately than if it appeared on the first page. This is sometimes called the lost-in-the-middle effect.

The practical upshot: more context isn't the same as better comprehension. Pasting everything you have doesn't guarantee the model reads it all with equal care. If something in a document matters, say so explicitly: point the model to it rather than assuming it found it on its own.

What This Means in Practice

Thinking in tokens helps you become a more effective AI user:

  • Be specific, not just long. More tokens doesn't mean better results. A well-structured 200-word prompt often outperforms a rambling 800-word one.

  • Point to what matters in long documents. Don't assume the model read everything equally — it didn't. If a specific section is important, say so.

  • Don't confuse fluency with accuracy. The model is optimizing for what text fits, not for what is true. Verification still matters.

For Government and Policy Readers

The following considerations apply to any institutional user, but they carry particular weight in government, First Nations governance, and regulated sectors: contexts where data handling, costs, and process integrity have formal consequences.

Not all languages tokenize equally. Tokenizers are built primarily on English text, which means English is processed efficiently. French, Indigenous languages, and specialized vocabularies all break into more tokens for the same amount of meaning — increasing cost, but more importantly, tending to produce less reliable outputs (Ahia et al., 2023). If your work involves multilingual content or Indigenous languages, performance will not be uniform, and this gap is not closed simply by choosing a better model.

Reasoning models multiply token costs. Some AI models generate internal "thinking" tokens (chains of reasoning steps) before producing a response. These "thinking" tokens aren't a different kind of cognition. They're produced by the same next-token mechanism described earlier, just applied to itself first — the model predicts a chain of intermediate tokens that work through the problem, then predicts a final answer from that expanded context. These tokens are largely invisible to the user but are billed at normal rates, sometimes driving total costs far higher than a standard model would. It also helps to know that a provider rarely offers a single model. Most offer a ladder of them at different capability and price points — Anthropic's Claude line, for instance, now spans Fable (its most capable and most expensive tier), Opus (a strong second tier), Sonnet (a mid-tier balance), and Haiku (fastest and cheapest) (Anthropic, 2026b), and competing tools offer a similar tiered range. Reasoning is not a separate model you bolt on: today's flagship models are themselves reasoning-capable, and what you adjust is how hard the model thinks before it answers — usually an effort or "thinking" control in the app, or a parameter for developers. Turn it up and answers may improve, but the invisible thinking tokens, and the bill, climb with it. The capability you saw in a polished demo and the model quietly assigned to your day-to-day workload may therefore sit at very different points on the price curve. When evaluating or procuring AI tools, it is worth confirming whether the model being demonstrated is the same one that will be used (and billed) in production (Han et al., 2025).

Same input, different output. Token prediction involves controlled randomness: send the same prompt twice and you may receive meaningfully different responses. Even turning randomness off doesn't fully fix this; infrastructure-level quirks mean identical prompts can still produce different outputs from run to run (He & Thinking Machines Lab, 2025). For government work where consistency, traceability, or auditability is required, this matters, and it isn't something you can fix by simply dialing down randomness. It does not make AI unusable, but it does mean that AI-assisted outputs cannot be treated as reproducible in the way that a formula or a database query is (Bommasani et al., 2021).

Vendor pricing comparisons are not straightforward. Different providers use different tokenizers, so "100,000 tokens" from one vendor is not the same amount of text as "100,000 tokens" from another. A procurement process that compares per-token prices without accounting for tokenization efficiency may reach the wrong conclusion. When comparing vendors, test each against the same sample of your actual content and compare output token counts directly.

Sidebar: Are tokens about to get cheaper?

Maybe, but cheaper to generate is not the same as cheaper to use. Tech journalists have noted that a new generation of Nvidia hardware (the "Blackwell" systems) produces tokens far more efficiently than the chips before it (Barr, 2026; SemiAnalysis figures). Some providers are already cutting prices.

The catch is a pattern economists call the Jevons paradox: when something gets cheaper, people use far more of it. Per-token prices have fallen roughly 98% since 2022, yet total AI bills have risen, because cheaper tokens invite heavier use—longer prompts, bigger documents, and reasoning models that burn thousands of invisible "thinking" tokens per answer. So the headline "token prices are about to plummet" seems true at the level of raw cost, but misleading from a monthly bill perspective. For anyone purchasing AI tools, the unit price is the wrong thing to watch. Total consumption is the number that moves your bill.

The Bigger Picture

The token is a small, unglamorous thing: part price tag, part ID number, part word fragment. But it's also the fundamental building block of everything that looks like AI intelligence today. When a language model creates a document, drafts an email, or answers a question, it is doing nothing more than predicting one token at a time.

I already knew the answer. A Black Friday discount on consulting services could make for a hilarious ad for an edgy consulting firm, but it's not how clients hire consultants. I asked anyway, out of curiosity, and expecting some amusing dialog.

Copilot thought it was a great idea. It produced an enthusiastic plan, complete with suggested discount tiers and promotional messaging. I had been using Copilot for about four months at that point and had been feeling pretty good about it. That response was a useful wake-up call.

References

Ahia, O., Ogueji, K., Stenetorp, P., Eisenstein, J., & Goldwater, S. (2023). Do all languages cost the same? Tokenization in the era of commercial language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1904–1919. https://aclanthology.org/2023.emnlp-main.120

Anthropic. (2026a, March 13). 1M context is now generally available for Opus 4.6 and Sonnet 4.6. Claude Blog. https://claude.com/blog/1m-context-ga

Anthropic. (2026b, June 9). Claude Fable 5 and Claude Mythos 5. https://www.anthropic.com/news/claude-fable-5-mythos-5

Barr, A. (2026, June 12). Why AI token prices are about to plummet. Business Insider. https://www.businessinsider.com/ai-token-price-crash-nvidia-blackwell-gpus-2026-6

Bommasani, R., Hudson, D. A.,Adeli, E., Altman, R., Arora, S., von Arx, S., … Liang, P. (2021). On the opportunities and risks of foundation models. arXiv. https://arxiv.org/abs/2108.07258

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://arxiv.org/abs/2005.14165

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., & Chen, Z. (2025). Token-budget-aware LLM reasoning. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 24842–24855). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-acl.1274

He, H., & Thinking Machines Lab. (2025, September 10). Defeating nondeterminism in LLM inference. Thinking Machines Lab: Connectionism. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–1725. https://doi.org/10.18653/v1/P16-1162

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need.Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

Next
Next

No, that's not a good idea, in spite of what your A.I. chatbot says