Skip to content

LLM Fundamentals — Tokens, Attention & Transformers (2026)

Every GenAI engineering skill you will learn — prompt engineering, RAG, agents, evaluation — sits on top of one foundation: understanding how large language models actually work. Not the math. The mental models. This page gives you the intuition that makes everything else click.

Who this is for:

  • Software engineers switching into GenAI: You can code. You need the conceptual foundation to reason about LLMs, not a PhD-level paper walkthrough.
  • Junior GenAI engineers: You’ve used ChatGPT and maybe built a basic RAG app. You want to understand why your system behaves the way it does.
  • Senior engineers in interviews: You need to explain transformers, attention, and tokenization clearly and concisely under pressure.

A large language model is a next-token prediction machine. That is the single most important sentence in this guide. Everything a model does — answering questions, writing code, translating languages, summarizing documents — is the result of predicting the next token in a sequence, over and over.

This sounds reductive. It is not. The transformer architecture that powers modern LLMs achieves extraordinary capability from this simple objective. But understanding the objective helps you reason about the system’s behavior:

  • Why does a model sometimes “hallucinate” facts? Because it predicts plausible-sounding next tokens, not verified facts.
  • Why does a longer prompt cost more? Because the model processes every token in the input.
  • Why does rephrasing a prompt change the output? Because different tokens activate different patterns learned during training.
  • Why do models struggle with counting letters in “strawberry”? Because tokenization splits words into subword pieces, and the model never sees individual characters.

You do not need linear algebra to understand these dynamics. You need the right mental models.


The 2025-2026 period brought three shifts that directly change how you architect LLM systems: larger context windows, reasoning models, and competitive open-weight alternatives.

In 2023, 4K tokens was standard. GPT-4 offered 32K as a premium tier. By early 2026, the landscape looks radically different:

ModelContext WindowApprox. Pages of Text
GPT-4o128K tokens~300 pages
Claude 3.5 Sonnet200K tokens~500 pages
Gemini 2.0 Pro2M tokens~5,000 pages
Llama 3.1 405B128K tokens~300 pages

This changes architecture decisions. Many use cases that required RAG in 2023 can now be solved by stuffing the entire document into the context window. But context window size is not free — cost scales linearly, and retrieval quality can degrade with very long contexts.

OpenAI’s o1 and o3, Anthropic’s Claude extended thinking, and Google’s Gemini thinking mode introduced a new paradigm: models that “think” before responding. These models use chain-of-thought reasoning internally, spending more compute (and more tokens) to improve accuracy on hard problems.

For engineers, this means: the same model can operate in two modes. Fast mode for simple tasks. Slow, expensive mode for complex reasoning. Your architecture needs to account for both.

Llama 3.1, Mistral Large, and DeepSeek V3 narrowed the gap with proprietary models. For many production use cases — classification, extraction, summarization — a 70B open-weight model running on your infrastructure matches GPT-4-class performance at a fraction of the cost. Fine-tuning an open-weight model on your domain data often outperforms a general-purpose proprietary model.


Four concepts — what an LLM is, how tokenization works, what embeddings represent, and how attention operates — form the complete mental model for reasoning about LLM behavior.

Forget the Wikipedia definition. Here is the engineer’s mental model:

An LLM is a compressed representation of patterns in text. During training, the model reads trillions of tokens — books, code, websites, conversations — and learns statistical patterns: which tokens tend to follow which other tokens, in what contexts, with what relationships. The result is a function that takes a sequence of tokens as input and outputs a probability distribution over the next token.

That function has billions of parameters (weights). GPT-4 is estimated at ~1.8 trillion parameters across its mixture-of-experts architecture. Claude 3.5 Sonnet and Gemini models do not disclose parameter counts, but they are in the hundreds-of-billions range.

The key insight: the model does not “know” things the way a database knows things. It has learned patterns. When you ask “What is the capital of France?”, the model does not look up France in a table. It has seen the pattern “capital of France” followed by “Paris” so many times during training that “Paris” has an extremely high probability as the next token.

Before a model can process your text, it must convert it into tokens. A token is a unit of text — sometimes a word, sometimes part of a word, sometimes a single character or punctuation mark.

How tokenization works:

Most modern models use Byte Pair Encoding (BPE) or SentencePiece. The algorithm:

  1. Start with individual characters as the base vocabulary
  2. Find the most frequently co-occurring pair of tokens in the training data
  3. Merge that pair into a single new token
  4. Repeat until the vocabulary reaches a target size (typically 32K-100K tokens)

Why this matters for engineers:

  • “Strawberry” = 3 tokens. The tokenizer splits it into str + aw + berry. The model never sees the individual letters s-t-r-a-w-b-e-r-r-y. This is why counting letters is hard for LLMs.
  • Common words = 1 token. “the”, “and”, “function” are single tokens. Your prompts are cheaper and faster when they use common vocabulary.
  • Code is expensive. Variable names, syntax, and whitespace in code often tokenize into many small tokens. A 100-line Python file might be 500+ tokens.
  • Non-English text costs more. Most tokenizers are trained primarily on English text. Japanese, Arabic, and Hindi text can use 2-4x more tokens per word than English.

Token counts determine cost and speed. Every API call is billed by input tokens + output tokens. A GPT-4o call with 1,000 input tokens and 500 output tokens costs differently than one with 10,000 input tokens and 2,000 output tokens. Understanding tokenization is understanding your cost model.

Once text is tokenized, each token is converted into a dense vector — a list of numbers that represents the token’s meaning in a high-dimensional space. This vector is called an embedding.

GPT-4’s embeddings are estimated at 12,288 dimensions. Each token becomes a list of 12,288 numbers. The key property: tokens with similar meanings end up close together in this space. “king” and “queen” are close. “king” and “bicycle” are far apart.

Positional encodings are added to tell the model where each token sits in the sequence. Without this, the model would see “dog bites man” and “man bites dog” as identical — the same tokens, just rearranged.

Embeddings are the bridge between human-readable text and the mathematical operations the model performs. You will encounter embeddings again when you study RAG — they are how retrieval systems find relevant documents.

The attention mechanism is what makes transformers work. Here is the intuition without the math.

When processing the word “it” in the sentence “The cat sat on the mat because it was tired”, the model needs to figure out that “it” refers to “the cat”, not “the mat”. Attention solves this.

How attention works (conceptual):

For every token in the sequence, attention asks: “How relevant is every other token to understanding this token?” It computes a relevance score between every pair of tokens, then uses those scores to create a weighted combination of information from all tokens.

Think of it as a lookup. Each token creates three things:

  • A query: “What am I looking for?”
  • A key: “What do I contain?”
  • A value: “What information should I contribute?”

The query of one token is compared against the keys of all other tokens. High-scoring matches contribute more of their value. Low-scoring matches contribute less.

Multi-head attention runs this process multiple times in parallel (GPT-4 uses 96 attention heads). Each head learns to attend to different types of relationships — syntactic, semantic, positional. One head might learn subject-verb agreement. Another might learn coreference resolution. Another might learn that a closing parenthesis should match an opening one.

Why engineers should care:

  • Attention is why context window size matters. The model computes attention between every pair of tokens. With N tokens, that is N-squared comparisons. Doubling the context window quadruples the compute for attention.
  • Attention is why prompt structure matters for prompt engineering. Placing relevant information near the query (rather than buried in a long context) helps the model attend to it more effectively.
  • Attention is why models can “forget” information in very long contexts. With 100K+ tokens, attention scores spread thin. Information in the middle of a long document gets less attention than information at the beginning or end.

A transformer processes text through a sequence of layers. Each layer transforms the representation of every token, gradually building up from surface-level patterns (syntax, word identity) to deep patterns (reasoning, world knowledge).

Raw text enters the model as a string. The tokenizer converts it into a sequence of integer IDs (one per token). Each ID maps to an embedding vector. Positional encodings are added. The result: a matrix where each row is a token’s initial representation.

Layer 2: The Transformer Block (Repeated N Times)

Section titled “Layer 2: The Transformer Block (Repeated N Times)”

The core of the model is a transformer block, repeated many times. GPT-4 is estimated to have 120 layers. Each block contains:

  1. Multi-head self-attention: Every token attends to every other token. Outputs are a refined representation of each token that incorporates information from the full context.
  2. Feed-forward network (FFN): A per-token transformation. Two linear layers with a nonlinearity in between. This is where researchers believe factual knowledge is stored — the FFN layers act as a compressed knowledge base.
  3. Layer normalization and residual connections: Technical details that stabilize training and allow information to flow through deep networks.

After 120+ rounds of attention and feed-forward processing, each token’s representation encodes rich contextual information about its meaning, its relationships, and its role in the sequence.

The final token’s representation is projected onto the vocabulary — a vector of ~100K numbers, one per possible token. A softmax function converts these into probabilities. The token with the highest probability is the model’s prediction for the next token.

The model generates text one token at a time. It takes the full input sequence, predicts the next token, appends that token to the sequence, and repeats. This is called autoregressive generation.

A 500-token response requires 500 forward passes through the entire model. This is why output tokens are more expensive than input tokens in most API pricing — each output token requires a full model inference.


The diagram below traces text through each stage of the transformer pipeline, from raw input to next-token probability distribution.

How a Transformer Processes Text

Each layer transforms the input representation. Understanding these layers gives you intuition for prompt engineering, fine-tuning, and debugging.

Input Text
Raw user prompt or document — the starting point
Tokenizer
Splits text into subword tokens (BPE, SentencePiece). GPT-4: ~100K vocabulary
Token Embeddings
Each token becomes a dense vector (768-12288 dims) + positional encoding
Attention Layers
Self-attention computes relationships between all tokens. The core innovation.
Feed-Forward Network
Per-token transformation. Where factual knowledge is stored.
Output Probabilities
Softmax over vocabulary → next token prediction. Temperature controls distribution shape.
Idle

These four examples translate LLM concepts directly into engineering decisions you will encounter in production: cost estimation, architecture selection, temperature tuning, and hallucination mitigation.

Example 1: Why Tokenization Affects Your Cost

Section titled “Example 1: Why Tokenization Affects Your Cost”

You are building a customer support chatbot. Your system prompt is 800 tokens. Each user message averages 50 tokens. Each response averages 300 tokens. With GPT-4o pricing at $2.50/M input tokens and $10/M output tokens:

  • Per conversation turn: (800 + 50) x $2.50/M + 300 x $10/M = $0.002 + $0.003 = $0.005
  • 10-turn conversation: System prompt is resent every turn. Input tokens grow: 800 + (50 x 10) = 1,300 input tokens on the last turn. Total cost: ~$0.06

Optimization: use a shorter system prompt. Reducing it from 800 to 400 tokens saves 40% on input costs across all turns. Use a tokenizer tool (like OpenAI’s tiktoken or Anthropic’s token counter) to measure before and after.

Example 2: Context Window Architecture Decisions

Section titled “Example 2: Context Window Architecture Decisions”

You need to build a Q&A system over 500 internal documents, each averaging 2,000 tokens. Total corpus: 1M tokens.

Option A — Stuff the context: With Gemini 2.0 Pro’s 2M token window, you could literally pass all documents in a single prompt. Cost: ~$1.25 per query at Gemini’s input pricing. Latency: 30-60 seconds. Accuracy on retrieval: the model processes everything, but attention degrades for information in the middle.

Option B — RAG: Embed all documents. Retrieve the top 5 most relevant chunks (~2,000 tokens). Pass only those to the model. Cost: ~$0.01 per query. Latency: 2-5 seconds. Accuracy: depends on retrieval quality, but focused context means stronger attention.

For most production systems, Option B wins. But for ad-hoc analysis of a small corpus, Option A is simpler.

Example 3: Temperature Controls Output Variability

Section titled “Example 3: Temperature Controls Output Variability”

Temperature scales the probability distribution before sampling. Here is what different temperatures do in practice:

TemperatureBehaviorUse Case
0.0Greedy — always picks the highest-probability tokenClassification, extraction, structured output
0.3-0.5Low variability — sticks close to the most likely tokensCustomer support, factual Q&A
0.7-0.9Moderate variability — creative but coherentContent generation, brainstorming
1.0+High variability — surprising, sometimes incoherentCreative writing, exploration

Rule of thumb: If the task has one correct answer (extraction, classification), use temperature 0. If the task benefits from variety (writing, ideation), use 0.7. Almost no production system uses temperature above 1.0.

A model trained on text from 2023 is asked: “Who won the 2025 Super Bowl?” The model has no training data about this event. But it has learned the pattern: “Who won the [year] Super Bowl?” is typically followed by a team name. So it predicts a plausible-sounding team name with high confidence.

This is hallucination. The model is doing exactly what it was trained to do — predict the most likely next tokens. It has no mechanism to say “I don’t know” unless that pattern was reinforced during training (which RLHF partially addresses, but not perfectly).

Practical implications:

  • Never trust model output for factual claims without verification
  • Use RAG to ground responses in retrieved documents
  • Use evaluation pipelines to measure hallucination rates
  • Set system prompts that instruct the model to say “I don’t know” when uncertain

Every LLM architecture decision reduces to trade-offs across size, cost, latency, quality, and hosting — the framework below makes those trade-offs explicit.

Every model selection involves trade-offs across these dimensions:

DimensionSmall Model (7-13B)Medium Model (30-70B)Frontier Model (GPT-4o, Claude 3.5)
Latency<500ms1-3s2-10s
Cost per 1M tokens$0.10-0.30$0.50-2.00$2.50-15.00
Reasoning abilityBasicGood for domain tasksStrong general reasoning
HostingSingle GPU2-4 GPUsAPI only (mostly)
Fine-tuningEasy, cheapModerateLimited or expensive

Use a small model (Llama 3.1 8B, Mistral 7B) when:

  • The task is narrow and well-defined (classification, extraction, routing)
  • Latency budget is <1 second
  • You have training data to fine-tune for your domain
  • Volume is high enough that API costs would be prohibitive

Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when:

  • The task requires general reasoning across domains
  • You need strong instruction-following without fine-tuning
  • Quality matters more than cost
  • You are prototyping and need fast iteration

Use a reasoning model (o3, Claude extended thinking) when:

  • The task involves multi-step logic, math, or code generation
  • Accuracy on hard problems matters more than latency
  • You can afford 10-30 second response times
ScenarioBest ApproachWhy
Corpus <50K tokens, staticStuff the contextSimpler, no retrieval infrastructure
Corpus 50K-500K tokensRAGBetter cost, attention quality
Corpus >500K tokensRAG + rerankingMust retrieve; full context too expensive
Corpus changes frequentlyRAGEmbedding updates cheaper than re-prompting
Need precise citationRAGCan track which chunks contributed

These questions are drawn from real GenAI engineering interviews; the strong answers below reflect what distinguishes candidates who understand the architecture from those who have only used the APIs.

Q: What is a token? How does tokenization work?

Strong answer: “A token is a subword unit. The tokenizer uses BPE to build a vocabulary of common subword sequences. Common English words are single tokens. Rare words and code get split into multiple tokens. The vocabulary size is typically 32K-100K. Tokenization affects cost because API pricing is per-token, and it affects model behavior because the model reasons at the token level, not the character level.”

Q: What is the difference between temperature 0 and temperature 1?

Strong answer: “Temperature scales the logit distribution before sampling. At temperature 0, the model always picks the highest-probability token — it is deterministic. At temperature 1, the original distribution is used — lower-probability tokens have a real chance of being selected. Higher temperature means more randomness. I use temperature 0 for extraction and classification, 0.7 for content generation.”

Q: Why do LLMs hallucinate?

Strong answer: “LLMs are trained to predict the next most likely token, not to verify factual accuracy. When asked about something outside their training data or something rare in their training data, they generate plausible-sounding tokens based on pattern matching. The model has no built-in mechanism to distinguish ‘I know this’ from ‘this sounds right.’ RLHF reduces but does not eliminate hallucination.”

Q: How would you design a system that uses both a small model and a large model?

Strong answer: “I would use a model router pattern. A small, fast model (8B parameters, <200ms latency) handles classification: is this query simple or complex? Simple queries — FAQ lookups, classification, extraction — go to the small model. Complex queries — multi-step reasoning, novel analysis, ambiguous intent — route to a frontier model. This reduces cost by 60-80% while maintaining quality on hard queries. The router itself can be fine-tuned on your traffic distribution.”

Q: Explain the attention mechanism to a product manager.

Strong answer: “When the model reads a sentence, each word looks at every other word to understand context. ‘It’ in ‘The server crashed because it ran out of memory’ needs to know that ‘it’ refers to ‘the server.’ Attention computes these connections automatically. Longer inputs mean more connections to compute, which is why longer prompts are slower and more expensive. And very long contexts can dilute attention — important information in the middle of a 100-page document might get less focus than information at the start or end.”

Q: What are the failure modes of large context windows?

Strong answer: “Three main failure modes. First, the ‘lost in the middle’ problem — models attend more strongly to the beginning and end of long contexts, so information in the middle gets less attention. Second, cost scales linearly with context length, so a 200K token context at GPT-4o pricing is $0.50 per query — that adds up fast. Third, latency increases with context length. For production systems serving thousands of queries, stuffing 200K tokens per request is usually not viable. RAG with targeted retrieval is more practical.”


Running LLMs in production requires active management of token budgets, latency, and the infrastructure decision between managed APIs and self-hosted models.

Every production LLM application needs a token budget strategy. Here is a practical framework:

  1. Measure your baseline. Use a tokenizer library to count tokens in your system prompt, average user input, and average model output. Track this as a metric.
  2. Set per-request limits. Cap max_tokens in your API calls. A customer support bot should not generate 4,000-token responses. Set it to 500 and tune from there.
  3. Monitor token usage. Track input and output tokens per request. Alert when averages shift — a prompt change that adds 200 tokens to every request costs real money at scale.
  4. Compress context. Summarize conversation history instead of passing the full transcript. A 20-turn conversation with full history can exceed 10K tokens. A summary fits in 500.
TechniqueImpactComplexity
Streaming responsesPerceived latency drops 50-80%Low
Shorter promptsLinear reduction in input processingLow
Model downgrade (GPT-4o mini vs GPT-4o)2-5x fasterLow
Caching common queriesNear-zero latency for cache hitsMedium
Prompt caching (Anthropic, OpenAI)50-90% reduction on repeated prefixesLow
Self-hosted model with batchingFull control over throughputHigh

Use an API (OpenAI, Anthropic, Google) when:

  • You want zero infrastructure overhead
  • Your volume is <100K requests/day
  • You need frontier-model quality
  • Data privacy requirements allow third-party processing

Self-host (vLLM, TGI, Ollama) when:

  • Data cannot leave your infrastructure (healthcare, finance, government)
  • Volume exceeds 100K requests/day and cost matters
  • You need to fine-tune a model on proprietary data
  • Latency requirements demand co-located inference
ProviderKey ModelsStrengths
OpenAIGPT-4o, o3, GPT-4o miniBroad capability, strong tool use, reasoning (o3)
AnthropicClaude 3.5 Sonnet, Claude 3 OpusLong context (200K), strong instruction following, safety
GoogleGemini 2.0 Pro, Gemini 2.0 FlashMassive context (2M), multimodal, competitive pricing
MetaLlama 3.1 (8B, 70B, 405B)Open-weight, fine-tunable, self-hostable
MistralMistral Large, MixtralEuropean provider, strong multilingual, MoE efficiency
DeepSeekDeepSeek V3Open-weight, strong reasoning, very competitive benchmarks

No single model dominates all tasks. Production systems increasingly use multiple models — a fast model for routing, a mid-tier model for common tasks, and a frontier model for hard queries.


With LLM fundamentals in place, the logical next steps are prompt engineering, RAG, agents, and evaluation — each builds directly on what you have learned here.

  1. LLMs are next-token predictors. Every behavior — good and bad — follows from this objective.
  2. Tokenization is the first bottleneck. It determines cost, speed, and the model’s ability to reason about text.
  3. Attention is the core innovation. It lets every token in the sequence consider every other token.
  4. Context windows are not free. Bigger is not always better — cost, latency, and attention quality all degrade with length.
  5. Temperature controls the randomness-reliability trade-off. Low for accuracy, high for creativity.
  6. Hallucination is structural, not a bug. The model predicts likely tokens, not verified facts. Design your system accordingly.
  7. Model selection is a multi-dimensional trade-off. Size, cost, latency, quality, and hosting constraints all matter.

With this foundation, every other topic on the site will make more sense:

  • Prompt Engineering — Now that you understand tokens and attention, learn how to structure inputs for better outputs.
  • RAG Architecture — Understand how retrieval augmented generation solves the knowledge limitation of LLMs.
  • AI Agents — Learn how LLMs go from answering questions to executing multi-step tasks.
  • LLM Evaluation — Build the testing infrastructure that catches hallucinations and quality regressions.
  • Fine-Tuning vs RAG — Learn when and how to adapt a model to your specific domain.
  • Python for GenAI — Set up your development environment for building LLM applications.

Frequently Asked Questions

How do large language models work?

A large language model is a next-token prediction machine. It takes a sequence of tokens as input and predicts the most probable next token, repeating this process to generate complete responses. The transformer architecture powers this through self-attention, which lets each token consider its relationship to every other token.

What is tokenization in LLMs?

Tokenization splits text into subword pieces (tokens) that the model processes. Words are split into common subword units — 'programming' might become 'program' and 'ming'. This is why models struggle with tasks like counting letters in 'strawberry' — they never see individual characters, only subword tokens. Learn more in the tokenization guide.

What is attention in transformers?

Attention is the mechanism that lets each token consider its relationship to every other token in the sequence. When processing the word 'bank' in a sentence about finance versus rivers, attention weights shift to disambiguate the meaning. Self-attention scales quadratically with sequence length, which is why context windows have cost and size limits.

Why do LLMs hallucinate facts?

LLMs hallucinate because they predict plausible-sounding next tokens, not verified facts. The model has no mechanism to distinguish between generating a true statement and a false one — both are just probable token sequences. The training objective is to produce text that statistically resembles the training data, not to be factually accurate.

What are embeddings in LLMs?

Embeddings are dense vectors — lists of numbers — that represent a token's meaning in a high-dimensional space. GPT-4's embeddings are estimated at 12,288 dimensions. Tokens with similar meanings end up close together in this space. Learn more in the embeddings guide.

What is the difference between temperature 0 and temperature 1 in LLMs?

Temperature scales the probability distribution before sampling the next token. At temperature 0, the model always picks the highest-probability token (deterministic, greedy). At temperature 1, the original distribution is used, giving lower-probability tokens a chance. Use temperature 0 for extraction and classification; use 0.7 for content generation.

How do context windows affect LLM cost and performance?

Context windows determine how many tokens a model can process in a single request. Larger windows allow processing more text, but cost scales linearly and attention quality can degrade. The 'lost in the middle' problem means models attend more strongly to the beginning and end of long inputs. See the context windows guide for details.

When should I use RAG instead of a large context window?

For corpora under 50K tokens, stuffing the context window is simpler. For corpora above 50K tokens, RAG with targeted retrieval is cheaper, faster, and often more accurate. RAG also wins when your corpus changes frequently or when you need precise citation tracking.

How do I choose between a small model and a frontier model?

Use a small model (7-13B parameters) when the task is narrow, latency must be under 1 second, you have fine-tuning data, or volume makes API costs prohibitive. Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when the task requires general reasoning or strong instruction-following without fine-tuning.

What is the transformer pipeline from input text to output?

The transformer pipeline has three stages. First, raw text is split into tokens and converted into embedding vectors with positional encoding. Second, representations pass through repeated transformer blocks containing multi-head self-attention and feed-forward networks. Finally, the last token's representation is projected onto the vocabulary to produce a next-token probability distribution.