Skip to content

Tokenization Guide — BPE, SentencePiece & Token Counting (2026)

Every LLM skill you build — prompt engineering, RAG, embeddings, and cost optimization — relies on a single mechanism you probably never think about: tokenization. It is the invisible layer between your text and the model. Understanding it eliminates surprises in production.

Who this is for:

  • Software engineers new to GenAI: You know how to call the API but you want to understand why token counts vary, why code prompts cost more than prose, and why some languages inflate your bill.
  • GenAI engineers optimizing costs: You are hitting context window limits or want to predict API spend accurately before deploying a feature.
  • Engineers preparing for interviews: Tokenization appears in senior GenAI engineering interviews more often than candidates expect — especially the BPE algorithm and multilingual token count implications.

Most engineers treat tokenization as a black box: “text goes in, tokens come out.” That is fine for a hello-world demo. In production, the black box bites you.

Every major LLM API — OpenAI, Anthropic, Google, Cohere — charges per token, not per character or per word. When you write a prompt, you are spending money proportional to its token count. When the model responds, you pay for output tokens too. A system prompt you copy-paste from a tutorial might be 800 tokens. Multiply that by 500,000 daily requests and you are paying for 400 million extra tokens per day that you did not budget for.

Knowing tokenization lets you estimate cost before deploying. It lets you identify which parts of a prompt are expensive and which are free. It is a first-class engineering skill, not an academic detail.

The famous 128K context window in GPT-4o or the 200K context window in Claude 3.5 Sonnet — those numbers are in tokens. Whether you can fit a document in a single API call depends on its token count, not its word count or file size. An engineer who does not count tokens cannot reliably architect systems that use long context. They discover the limit in production, during an incident.

LLMs are next-token prediction machines — a concept covered in detail in LLM Fundamentals. The model never sees “strawberry” — it sees the token IDs that represent “st”, “raw”, “berry” or some other split depending on the tokenizer. This has consequences:

  • Spelling and letter-counting tasks are harder than they look because the model reasons over tokens, not characters
  • Arithmetic on numbers is non-trivial because “12345” might be one token or multiple tokens depending on its position in training data frequency
  • Some code patterns are more expensive than others because whitespace and special characters tokenize differently than English words

These are not edge cases. They are the everyday texture of working with LLMs.


2. Why Tokenization Matters for Engineers (Context Windows and Multilingual)

Section titled “2. Why Tokenization Matters for Engineers (Context Windows and Multilingual)”

Token counts diverge significantly from word counts for code, JSON, and non-English languages — differences that break cost models and context window calculations in production.

A rough rule of thumb: one English word ≈ 1.3 tokens. But this breaks down quickly:

Content typeApproximate tokens per word
Common English words1 token
Technical English (rare words)1.5–2 tokens
Python or JavaScript code1.5–3 tokens
JSON or YAML2–4 tokens
Spanish / French / German1.2–1.5 tokens
Japanese / Chinese / Korean2–4 tokens
Arabic / Hindi / Thai3–6 tokens

If you build a multilingual RAG system or a code assistant, the English-centric token counting assumption will break your cost models and context window calculations.

Tokenization is also the first step when generating embeddings. The same text goes through the same tokenizer before being transformed into a vector. This means token count limits apply to embedding models too — most embedding models have a maximum of 512 or 8192 tokens. Chunking strategies for RAG must account for token limits, not character limits. A chunk that is “500 words” in English may be 600 tokens. In Japanese it may be 1,400 tokens.


Byte Pair Encoding is the algorithm behind OpenAI’s tiktoken, Anthropic’s Claude tokenizer, and Meta’s Llama tokenizer. It is the dominant algorithm in modern LLMs.

BPE starts with the insight that you need a fixed-size vocabulary that can represent any text — including words and phrases the model has never seen. You cannot enumerate every word in every language. But you can represent any text if you fall back to individual bytes (characters). The trick is to do this efficiently, so common sequences use single tokens.

Step 1 — Initialize with character vocabulary. Start with a vocabulary of individual characters (or bytes). Every character is one token. “hello world” starts as: h, e, l, l, o, , w, o, r, l, d.

Step 2 — Count pair frequencies. Count how often each adjacent pair of tokens appears across the entire training corpus. In a typical English corpus, e + r might appear 800,000 times. th + e might appear 2 million times.

Step 3 — Merge the most frequent pair. Create a new token for the most frequent pair. t + hth (token ID 256, for example). Replace every occurrence of t h in the corpus with the single token th.

Step 4 — Repeat. Count pair frequencies again. Merge again. Repeat until you reach your target vocabulary size (typically 32,000 to 100,000 tokens).

The result: Common English words become single tokens (“hello” → one token ID). Uncommon words are represented as sequences of subword tokens (“tokenization” → “token” + “ization” or “token” + “iz” + “ation” depending on training corpus frequency). Rare or invented words fall back to character-level tokens.

BPE achieves an elegant tradeoff. Common sequences (frequently used words, common programming keywords) get single tokens — efficient to represent and cheap to process. Rare sequences are split into familiar subword pieces — still representable, just more tokens. No word is ever “unknown” to the model because it can always fall back to byte-level representation.

The vocabulary size is a tunable hyperparameter. GPT-2 used 50,257 tokens. GPT-4 uses ~100,000. Larger vocabularies encode common text more efficiently (fewer tokens per sentence) but require more memory and are harder to train.


4. BPE Merge Process — Visual Explanation

Section titled “4. BPE Merge Process — Visual Explanation”

The diagram below shows how BPE builds a vocabulary bottom-up — starting from individual characters, iteratively merging the most frequent adjacent pair until the target vocabulary size is reached.

BPE Tokenization — From Characters to Vocabulary

BPE builds a vocabulary bottom-up by iteratively merging the most frequent adjacent pair. Common sequences become single tokens; rare sequences fall back to subwords or characters.

InitializeCharacter vocabulary
h e l l o
w o r l d
Each char = 1 token
Count PairsCorpus-wide frequency
lo → 12,400x
he → 18,200x
th → 2,100,000x
Merge Top PairCreate new token
t + h → th
New token ID: 256
Replace in corpus
RepeatUntil vocab size reached
the → token 312
hello → token 4,521
32K–100K total tokens
Idle

When you send text to GPT-4o or Claude, the tokenizer applies the learned merge rules in order. The text is first split into pre-tokenization units (roughly by whitespace and punctuation), then BPE merges are applied to each unit. The output is a sequence of integer token IDs that the model’s embedding layer converts to vectors.

This process is deterministic and fast — a modern tokenizer processes tens of millions of characters per second.


BPE is not the only approach. Two other algorithms are widely used.

WordPiece is used by BERT, DistilBERT, and Google’s early transformer models. It is conceptually similar to BPE but uses a different criterion for merging pairs.

How it differs from BPE: BPE merges the most frequently occurring pair. WordPiece merges the pair that maximizes the likelihood of the training data under the current language model. This is a probabilistic criterion rather than a pure frequency criterion. In practice, the resulting vocabularies are similar, but WordPiece tends to preserve more semantic coherence in subword splits.

Subword prefix convention: WordPiece marks continuation subwords with a ## prefix. The word “tokenization” might split as: token, ##ization. The ## signals that this subword continues from the previous token without a space.

When you will encounter it: If you use sentence-transformers or BERT-based embedding models for RAG retrieval, those models use WordPiece tokenization. Token count limits (typically 512 tokens for BERT-style models) apply to WordPiece tokens, which may differ from the GPT-4 tiktoken count for the same text.

SentencePiece, developed at Google and used in models like T5, Gemma, and LLaMA 2 (via its BPE variant), takes a different approach to the pre-tokenization problem.

The key difference: Standard BPE and WordPiece assume that text is pre-split on whitespace. This works well for English but breaks down for Japanese, Chinese, Thai, and other languages that do not use spaces between words. SentencePiece treats the raw input — including spaces — as just another character. It operates directly on the raw Unicode stream without any language-specific pre-processing.

Result: SentencePiece is language-agnostic. It handles English, Japanese, Arabic, and code with the same algorithm. This is why it is the tokenizer of choice for multilingual models.

Unigram variant: SentencePiece can use either BPE or a “Unigram Language Model” algorithm. The Unigram approach works differently — it starts with a large candidate vocabulary and iteratively removes tokens that contribute least to the language model likelihood, rather than building up from characters. Unigram tends to produce more consistent tokenization of rare words.

AlgorithmUsed byKey characteristic
BPEGPT-4, Claude, Llama 3Merges most frequent pairs; byte-level fallback
WordPieceBERT, DistilBERT, sentence-transformersMerges by language model likelihood; ## prefix notation
SentencePiece (BPE)T5, Gemma, Llama 2, MistralLanguage-agnostic; no whitespace pre-tokenization
SentencePiece (Unigram)ALBERT, mBARTStarts large, prunes to target vocabulary size

Counting tokens before sending to an API is a production discipline, not an optional step. You need it to avoid context window overflow, predict costs, and implement smart chunking for RAG.

Counting tokens for OpenAI models (tiktoken)

Section titled “Counting tokens for OpenAI models (tiktoken)”

OpenAI’s tiktoken library is the standard way to count tokens for GPT models.

import tiktoken
# Load the encoding for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Tokenization is the invisible layer between your text and the model."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}") # Token count: 13
print(f"Token IDs: {tokens}") # [51603, 2065, ...]
print(f"Decoded back: {enc.decode(tokens)}") # original text
# Count tokens in a chat messages list (includes message overhead)
def count_tokens_for_messages(messages, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
tokens_per_message = 3 # every message has <|start|>{role}\n{content}<|end|>
tokens_per_name = 1
total = 0
for message in messages:
total += tokens_per_message
for key, value in message.items():
total += len(enc.encode(value))
if key == "name":
total += tokens_per_name
total += 3 # reply is primed with <|start|>assistant<|message|>
return total
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain tokenization in two sentences."},
]
print(f"Messages token count: {count_tokens_for_messages(messages)}")

Counting tokens for Claude (Anthropic SDK)

Section titled “Counting tokens for Claude (Anthropic SDK)”

The Anthropic Python SDK provides a count_tokens method that calls the API to count tokens accurately for Claude models.

import anthropic
client = anthropic.Anthropic()
# Count tokens for a message list
response = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Explain tokenization in two sentences."}
]
)
print(f"Input tokens: {response.input_tokens}")

Counting tokens for open-source models (Hugging Face transformers)

Section titled “Counting tokens for open-source models (Hugging Face transformers)”

For Llama, Mistral, Gemma, and other open-source models, use the tokenizer from the Hugging Face transformers library.

from transformers import AutoTokenizer
# Load tokenizer for any Hugging Face model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
text = "Tokenization is the invisible layer between your text and the model."
tokens = tokenizer.encode(text)
print(f"Token count: {len(tokens)}") # Llama 3 may differ from tiktoken
# Tokenize a chat with the correct chat template applied
messages = [
{"role": "user", "content": "Explain tokenization in two sentences."}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
token_count = len(tokenizer.encode(formatted))
print(f"Chat template token count: {token_count}")

Important: Different models tokenize the same text differently. A prompt that is 800 tokens in tiktoken (GPT-4o) might be 840 tokens in the Llama 3 tokenizer and 780 tokens in a SentencePiece tokenizer. If you switch models, recalibrate your token count estimates.

OpenAI’s Tokenizer tool (platform.openai.com/tokenizer) lets you paste text and see exactly how it is split into tokens with color coding. This is invaluable for debugging unexpected token counts and understanding why certain prompts are more expensive than you expect.


These are the tokenization behaviors that regularly surprise engineers in production.

English is the language most heavily represented in LLM training data. The tokenizer vocabulary is optimized for English — common English words are single tokens. For other languages, the same information requires significantly more tokens.

A sentence in English might be 15 tokens. The same meaning expressed in Japanese might be 35 tokens. In Arabic: 40–50 tokens. In Thai (no word boundaries): 50+ tokens. If you build a multilingual application and budget based on English token counts, your non-English traffic will cost 2–4x more than expected. This has caused real production budget overruns.

Mitigation: Test your actual language distribution. Count tokens for representative samples of each language. Set per-language budgets. For RAG, chunk non-English documents using a smaller token count target to account for higher token density.

Code is significantly more token-dense than English prose. Whitespace (tabs, indentation) is tokenized. Symbols like {, }, ->, =>, :: may each be separate tokens. Variable names that are not common English words (like customerOnboardingStatus or MAX_RETRY_BACKOFF_MS) split into many subword tokens.

A 100-line Python function is typically 400–800 tokens depending on indentation level, variable naming conventions, and comment density. For code-heavy prompts (like passing a codebase context to an agent), this adds up fast. The context windows page covers strategies for managing long code contexts.

Every model adds special tokens that are not part of the visible text but are counted toward the context window. These include:

  • System prompt delimiters — tokens that mark the start and end of the system message
  • Chat turn separators — tokens between user and assistant turns
  • BOS / EOS tokens — beginning-of-sequence and end-of-sequence markers added by the tokenizer
  • Tool call tokens — structured tokens for function calling schemas

For a conversation with a 200-token system prompt and 3 turns, the actual token count may be 240–280 tokens after special tokens are added. OpenAI’s tiktoken count_tokens_for_messages function accounts for this (shown in the code example above). Not accounting for special tokens is a common source of unexpected context_length_exceeded errors in multi-turn conversations.

Numbers tokenize inconsistently. Small integers (“1”, “2”, “100”) are often single tokens. Large numbers like “123456789” may split as “123”, “456”, “789” — three tokens. Decimal numbers, phone numbers, dates, and currency amounts all tokenize differently depending on their frequency in the training corpus.

This is a primary reason LLMs struggle with arithmetic. The model is not adding up numbers — it is predicting the next token in a sequence that includes token fragments of numbers. Always use code execution (via tool calling) for arithmetic-critical applications rather than relying on the model to compute.

Some LLM providers (Anthropic, OpenAI) offer prompt caching that dramatically reduces cost for repeated prefixes. Claude’s prompt caching charges roughly 10% of the normal input price for cached tokens (verified: Anthropic docs, 2026). But caching operates at token boundaries — the cache hit requires an exact token-level match for the prefix. Understanding tokenization helps you structure prompts to maximize cache hits: put the stable, repeated portion (system prompt, document context) at the beginning and the variable portion (user query) at the end.


Tokenization knowledge is directly applicable to reducing API costs. These are engineering techniques, not tricks.

Start by counting tokens for every component of your prompt: system message, retrieved context chunks, conversation history, user query, and output. Build a token breakdown dashboard for your top query types. You will almost always find that one component dominates, and it is rarely the user query.

System prompts accumulate over time as engineers add instructions. A system prompt that started at 150 tokens may grow to 900 tokens over six months as the team adds edge case handling, format instructions, and behavioral constraints. Audit system prompts regularly. Use bullet points instead of paragraphs. Remove redundant instructions. Reduce verbose examples to shorter ones. A 30% system prompt reduction on 1 million daily requests saves 270 million tokens per day.

Multi-turn applications that keep full conversation history in the context window grow token consumption quadratically with conversation length. Strategies:

  • Rolling window: Keep only the last N turns in context
  • Summarization: When history exceeds a threshold, summarize earlier turns into a compressed summary and use that instead of the raw messages
  • Selective retention: Keep high-information turns (user queries, tool results) and drop low-information turns (filler, acknowledgements)

For RAG applications, chunk size is a token count decision, not a character count decision. Larger chunks mean fewer retrieved chunks but higher token cost per query. Smaller chunks mean more targeted retrieval but potentially more chunks needed to cover the answer. The optimal chunk size depends on your query type distribution, which you can only discover empirically with an evaluation set.

JSON responses from LLMs cost tokens for every brace, quote, key name, and whitespace character. If your downstream code is the only consumer of the structured output, use compact JSON (no extra whitespace) and abbreviated key names. A field named "customerBillingAddress" costs more tokens than "addr". This matters at scale.

Different models charge different rates per token. GPT-4o-mini is roughly 30x cheaper per token than GPT-4o. For queries where a smaller model has sufficient capability — simple classification, extraction, formatting — routing to the cheaper model saves significant cost. This requires a routing classifier (usually a fast, cheap model) that evaluates query complexity and directs traffic appropriately.


Tokenization appears in GenAI engineering interviews more frequently than candidates expect — particularly for senior roles where system design and cost optimization are in scope.

”Walk me through how BPE tokenization works.”

Section titled “”Walk me through how BPE tokenization works.””

Describe the bottom-up construction process: start with a character vocabulary, count pair frequencies across the training corpus, merge the most frequent pair into a new token, repeat until target vocabulary size is reached. Emphasize the two benefits: common sequences get efficient single-token representation, rare sequences fall back to subwords or characters — so no word is ever unknown. Mention that the merge rules are fixed at training time and applied deterministically at inference.

”Why might the same text produce different token counts with different models?”

Section titled “”Why might the same text produce different token counts with different models?””

Each model family trains its own tokenizer on its own corpus, resulting in different vocabulary content and different BPE merge rules. GPT-4 uses the cl100k_base encoding; Llama 3 uses a SentencePiece BPE tokenizer trained on a different corpus. The same sentence may tokenize into 13 tokens on one and 15 on the other. For engineers: always use the model-specific tokenizer for counting, not a generic approximation.

”How does tokenization affect context window consumption in a RAG system?”

Section titled “”How does tokenization affect context window consumption in a RAG system?””

Token counting must account for all components: system prompt, retrieved chunks, conversation history, user query, and format instructions. Non-English content and code are more token-dense than English prose, so retrieval strategies must use smaller chunk token budgets for those content types. Special tokens added by the tokenizer for chat templates add overhead beyond the visible text. In practice, a RAG system should count tokens programmatically before every API call and implement graceful degradation (fewer retrieved chunks, shorter history) when approaching the limit.

”How would you reduce LLM API costs in a production system?”

Section titled “”How would you reduce LLM API costs in a production system?””

A strong answer covers multiple layers: (1) profile token counts by component to find the dominant cost driver, (2) compress system prompts and prune conversation history, (3) optimize RAG chunk size to retrieve fewer but more targeted chunks, (4) leverage prompt caching for stable prefixes (system prompt, document context), (5) route simpler queries to cheaper models, (6) measure output token usage and add max_tokens constraints to prevent unbounded generation.


Tokenization is the foundation layer for every cost, context window, and behavioral question you will encounter working with LLMs in production.

BPE is the dominant algorithm. It builds a vocabulary bottom-up by merging frequent character pairs, resulting in efficient encoding of common text and graceful fallback for rare words. GPT-4, Claude, and Llama 3 all use BPE variants. WordPiece (BERT) and SentencePiece (multilingual models) are the main alternatives.

Token counts are not word counts. English prose is roughly 1.3 tokens per word. Code is 1.5–3x. Japanese and Arabic are 3–6x. Non-English and code-heavy applications must account for this in cost models and context window planning.

Count tokens programmatically. Use tiktoken for OpenAI models, anthropic.messages.count_tokens() for Claude, and Hugging Face tokenizers for open-source models. Never estimate token counts from character or word counts in production.

Tokenization is an optimization lever. System prompt compression, conversation history pruning, RAG chunk size calibration, prompt caching alignment, and model routing are all tokenization-aware cost optimization strategies that apply at scale.


  • LLM Fundamentals — How transformers process token sequences into responses
  • Context Windows — Strategies for managing long contexts and avoiding overflow
  • Embeddings — How tokens are converted into vectors for semantic search
  • Prompt Engineering — Token-efficient prompt design for production systems

Last updated: March 2026. Token pricing and vocabulary sizes are verified against OpenAI, Anthropic, and Hugging Face documentation as of March 2026.

Frequently Asked Questions

What is tokenization in LLMs?

Tokenization is the process of converting text into numerical tokens that LLMs can process. LLMs don't see characters or words — they see token IDs from a fixed vocabulary. A token might be a whole word, a subword, or even a single character. The tokenizer determines how text is split and mapped to IDs, which directly affects model performance, cost, and context window usage.

What is BPE tokenization?

Byte Pair Encoding (BPE) is the most common tokenization algorithm used by GPT, Claude, and Llama models. It starts with individual characters, then iteratively merges the most frequent pair of adjacent tokens into a new token until reaching a target vocabulary size of typically 32K-100K tokens. BPE balances vocabulary size with encoding efficiency — common words become single tokens while rare words are split into subwords.

How do I count tokens before sending to an LLM?

Use the tiktoken library for OpenAI models, anthropic.messages.count_tokens() for Claude, and the Hugging Face transformers tokenizer for open-source models like Llama and Mistral. Always count tokens programmatically before API calls to avoid context window overflow and predict costs accurately.

Why does tokenization affect LLM costs?

LLM APIs charge per token — both input and output. The same text produces different token counts depending on the tokenizer. Code typically uses more tokens than English prose, and non-English languages often use 2-3x more tokens than English for the same content. Understanding tokenization helps you optimize prompts to reduce token count and estimate costs accurately.

What is the difference between BPE, WordPiece, and SentencePiece?

BPE merges the most frequently occurring adjacent pair and is used by GPT-4, Claude, and Llama 3. WordPiece merges pairs that maximize training data likelihood and is used by BERT and sentence-transformers. SentencePiece treats raw input including spaces as characters, making it language-agnostic and the tokenizer of choice for multilingual models like T5, Gemma, and Llama 2.

Why do non-English languages cost more tokens than English?

English is the most heavily represented language in LLM training data, so the tokenizer vocabulary is optimized for English — common English words become single tokens. For other languages, the same meaning requires significantly more tokens: Japanese may use 2-3x more, Arabic 3-4x, and Thai 4-6x. This directly inflates API costs and context window consumption for multilingual applications.

How does code tokenization differ from natural language?

Code is significantly more token-dense than English prose. Whitespace like tabs and indentation is tokenized separately, and symbols such as curly braces, arrows, and double colons may each be individual tokens. Variable names that are not common English words split into many subword tokens. A 100-line Python function typically produces 400-800 tokens.

What are special tokens and how do they affect context windows?

Special tokens are invisible tokens added by the tokenizer that count toward the context window but are not part of the visible text. These include system prompt delimiters, chat turn separators, beginning-of-sequence and end-of-sequence markers, and tool call tokens. For a conversation with a 200-token system prompt and 3 turns, the actual count may be 240-280 tokens after special tokens are added.

How does prompt caching relate to tokenization?

Prompt caching from providers like Anthropic and OpenAI reduces cost for repeated prefixes at roughly 10% of the normal input price. Caching operates at token boundaries, requiring an exact token-level match for the prefix. Understanding tokenization helps you structure prompts to maximize cache hits by placing the stable, repeated portion at the beginning and the variable user query at the end.

How can I optimize system prompts to reduce token costs?

System prompts accumulate over time and can grow from 150 to 900 tokens as teams add edge case handling and format instructions. Audit system prompts regularly, use bullet points instead of paragraphs, remove redundant instructions, and reduce verbose examples. A 30% system prompt reduction on 1 million daily requests saves 270 million tokens per day. See LLM Cost Optimization for more strategies.