Short definition
Important note: Some numbers presented here may change in the future, so it’s important to note that they reflect the current reality of June 2026.
An LLM token is the basic unit of text that a language model processes during both input encoding and output generation. Tokens do not map one-to-one with words: one English word is approximately 0.75 tokens on average, so 100 tokens equals roughly 75 words. Punctuation, whitespace, and subword fragments each count as individual tokens.
Extended definition
Before any text reaches a language model, a tokenizer converts raw input into a sequence of integer IDs, each representing an LLM token. These IDs are what the model actually reads. On the output side, the model generates tokens one at a time, and the tokenizer converts them back into readable text. This two-step process is invisible to most application developers, but it directly shapes performance, cost, and capability boundaries.
LLM Token counts matter because every model has a context window measured in tokens, not characters or words. A model with a 128,000-token context window can hold roughly 96,000 English words in a single conversation. Once that limit is reached, older content must be dropped, summarized, or offloaded to external memory. Misunderstanding this distinction causes real bugs: developers who estimate prompt size in words routinely exceed context limits in production.
Cost is also calculated in tokens. Commercial APIs from providers like OpenAI and Anthropic charge per thousand tokens for both input and output. A long system prompt repeated across thousands of requests can become a significant operating cost. Optimizing token usage is therefore a concrete engineering concern, not just a theoretical one.
Generation speed is reported in tokens per second, which is the correct unit for benchmarking inference throughput. A model producing 40 tokens per second generates roughly 30 words per second, a detail that matters for latency-sensitive applications like real-time chat or streaming completions.
Deep technical explanation
How tokenization works
Most modern language models use Byte Pair Encoding (BPE) or a variant like SentencePiece or Unigram. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs into single tokens until a target vocabulary size is reached, typically 32,000 to 100,000 tokens (at the time of writing this article, 20th of June 2026). The resulting vocabulary covers common English words as single tokens, less common words as two or three subword fragments, and rare strings as individual characters or bytes.
For example, the word ‘tokenization’ might be split into [‘token’, ‘ization’], while ‘cat’ is a single LLM token. Numbers are often split digit by digit, which is why arithmetic is difficult for language models: ‘1024’ may be four separate tokens with no inherent numeric relationship between them.
Context limits and the num_ctx parameter
A model’s context window (num_ctx in frameworks like Ollama) defines the maximum number of tokens across both the input prompt and the generated output. If a prompt consumes 90,000 tokens in a 128k-context model, the model can only generate up to 38,000 output tokens before hitting the hard ceiling. Exceeding this limit causes truncation of the input, not the output, meaning the model silently loses earlier context rather than warning the caller.
Token distribution across content types
English prose is the most token-efficient content type, averaging around 0.75 tokens per word. Code is denser: indentation, brackets, and identifiers often tokenize less efficiently, so the same number of lines of Python may consume significantly more tokens than an equivalent block of prose. Structured formats like JSON and XML are especially token-heavy because every brace, bracket, and key-value separator is a separate token or subword fragment.
Edge cases and failure modes
Non-English languages are often under-represented in training vocabularies, causing the tokenizer to fall back to subword or byte-level tokens. A single Chinese or Arabic word may require four to six tokens where one English word needs only one. This increases cost and reduces effective context for multilingual applications. Developers who build multilingual LLM features without accounting for this routinely overspend on inference and hit context limits far sooner than expected.
Another common failure mode involves injecting dynamic content, such as retrieved documents or user inputs, into a fixed-length prompt template without token-length validation. If the retrieved document is long, the combined prompt silently exceeds the context window and older, sometimes critical, instructions are dropped before the model processes the request.
Practical examples
Scenario 1: A team builds a document summarization tool and estimates prompt size in words. In testing, prompts stay under the stated word limit, but in production, legal documents with dense terminology cause context overflows. Adding a token counter before each API call, using the model’s own tokenizer library, eliminates the issue.
Scenario 2: A company deploying a multilingual customer support bot notices that Arabic queries cost three times more per request than English ones. After profiling token usage by language, the team restructures system prompts to be shorter and moves verbose instructions to a retrieval layer, reducing per-request cost by 40 percent.
Scenario 3: An engineering team benchmarks two hosted models for a streaming code assistant. Model A returns 60 tokens per second, Model B returns 25. For code completions averaging 80 tokens, Model A delivers results in under two seconds while Model B exceeds the acceptable latency threshold. Token-per-second benchmarking drives the infrastructure decision.
Scenario 4: A RAG (retrieval-augmented generation) pipeline retrieves five documents per query and injects them into the prompt. With no token budget enforcement, longer documents crowd out the user’s question and the model’s own instructions. Setting a hard token budget per retrieved chunk, truncating at the tokenizer level, stabilizes answer quality.
Why it matters
- Context window limits are enforced in tokens, so any system that estimates size in words or characters will produce unpredictable behavior in production.
- API costs from commercial LLM providers are billed per token, making token efficiency a direct line item in infrastructure budgets.
- Generation throughput is measured in tokens per second, which is the correct unit for latency SLA planning in real-time AI features.
- Multilingual applications consume tokens at significantly different rates by language, requiring language-aware token budgeting to avoid cost overruns.
- Prompt engineering and retrieval-augmented generation pipelines both require explicit token accounting to prevent context truncation and degraded output quality.
- Subword tokenization explains why language models struggle with arithmetic, character counting, and rare vocabulary, guiding correct application design decisions.