LLM Fundamentals — Tokens, Attention & Transformers (2026)

Q: How do large language models work?

A large language model is a next-token prediction machine. It takes a sequence of tokens as input and predicts the most probable next token, repeating this process to generate complete responses. The transformer architecture powers this through self-attention, which lets each token consider its relationship to every other token. Everything an LLM does — answering questions, writing code, translating languages — is the result of predicting the next token in a sequence.

Q: What is tokenization in LLMs?

Tokenization splits text into subword pieces (tokens) that the model processes. Words are split into common subword units — 'programming' might become 'program' and 'ming'. Different models use different tokenizers with different vocabularies. This is why models struggle with tasks like counting letters in 'strawberry' — they never see individual characters, only subword tokens. Every token in the input costs money and consumes context window space.

Q: What is attention in transformers?

Attention is the mechanism that lets each token consider its relationship to every other token in the sequence. When processing the word 'bank' in a sentence about finance versus a sentence about rivers, attention weights shift to attend to different surrounding words, disambiguating the meaning. Self-attention scales quadratically with sequence length, which is why context windows have cost and size limits.

Q: Why do LLMs hallucinate facts?

LLMs hallucinate because they predict plausible-sounding next tokens, not verified facts. The model has no mechanism to distinguish between generating a true statement and a false one — both are just probable token sequences. The training objective is to produce text that statistically resembles the training data, not to be factually accurate. This is why RAG and other grounding techniques are essential for production applications.

Q: What are embeddings in LLMs?

Embeddings are dense vectors — lists of numbers — that represent a token's meaning in a high-dimensional space. GPT-4's embeddings are estimated at 12,288 dimensions. Tokens with similar meanings end up close together in this space: 'king' and 'queen' are close, while 'king' and 'bicycle' are far apart. Positional encodings are added to tell the model where each token sits in the sequence.

Q: How do context windows affect LLM cost and performance?

Context windows determine how many tokens a model can process in a single request. Larger context windows allow processing more text, but cost scales linearly and attention quality can degrade with very long contexts. The 'lost in the middle' problem means models attend more strongly to the beginning and end of long inputs, while information in the middle gets less attention.

Q: When should I use RAG instead of a large context window?

For corpora under 50K tokens, stuffing the context window is simpler and effective. For corpora above 50K tokens, RAG with targeted retrieval is cheaper, faster, and often more accurate. RAG also wins when your corpus changes frequently, since embedding updates are cheaper than re-prompting, and when you need precise citation tracking of which chunks contributed to the answer.

Q: How do I choose between a small model and a frontier model?

Use a small model (7-13B parameters) when the task is narrow and well-defined, latency must be under 1 second, you have training data for fine-tuning, or volume makes API costs prohibitive. Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when the task requires general reasoning, strong instruction-following without fine-tuning, or when quality matters more than cost.

Q: What is the transformer pipeline from input text to output?

The transformer pipeline has three stages. First, raw text is split into tokens and each token is converted into an embedding vector with positional encoding added. Second, the token representations pass through repeated transformer blocks — each containing multi-head self-attention and a feed-forward network — building progressively richer contextual representations. Finally, the last token's representation is projected onto the vocabulary to produce a probability distribution for the next token.

Every GenAI engineering skill you will learn — prompt engineering, RAG, agents, evaluation — sits on top of one foundation: understanding how large language models actually work. Not the math. The mental models. This page gives you the intuition that makes everything else click.

Who this is for:

Software engineers switching into GenAI: You can code. You need the conceptual foundation to reason about LLMs, not a PhD-level paper walkthrough.
Junior GenAI engineers: You’ve used ChatGPT and maybe built a basic RAG app. You want to understand why your system behaves the way it does.
Senior engineers in interviews: You need to explain transformers, attention, and tokenization clearly and concisely under pressure.

1. Why LLM Fundamentals Matter

A large language model is a next-token prediction machine. That is the single most important sentence in this guide. Everything a model does — answering questions, writing code, translating languages, summarizing documents — is the result of predicting the next token in a sequence, over and over.

This sounds reductive. It is not. The transformer architecture that powers modern LLMs achieves extraordinary capability from this simple objective. But understanding the objective helps you reason about the system’s behavior:

Why does a model sometimes “hallucinate” facts? Because it predicts plausible-sounding next tokens, not verified facts.
Why does a longer prompt cost more? Because the model processes every token in the input.
Why does rephrasing a prompt change the output? Because different tokens activate different patterns learned during training.
Why do models struggle with counting letters in “strawberry”? Because tokenization splits words into subword pieces, and the model never sees individual characters.

You do not need linear algebra to understand these dynamics. You need the right mental models.

2. What Changed in 2025-2026

The 2025-2026 period brought three shifts that directly change how you architect LLM systems: larger context windows, reasoning models, and competitive open-weight alternatives.

Context Windows Exploded

In 2023, 4K tokens was standard. GPT-4 offered 32K as a premium tier. By early 2026, the landscape looks radically different:

Model	Context Window	Approx. Pages of Text
GPT-4o	128K tokens	~300 pages
Claude 3.5 Sonnet	200K tokens	~500 pages
Gemini 2.0 Pro	2M tokens	~5,000 pages
Llama 3.1 405B	128K tokens	~300 pages

This changes architecture decisions. Many use cases that required RAG in 2023 can now be solved by stuffing the entire document into the context window. But context window size is not free — cost scales linearly, and retrieval quality can degrade with very long contexts.

Reasoning Models Emerged

OpenAI’s o1 and o3, Anthropic’s Claude extended thinking, and Google’s Gemini thinking mode introduced a new paradigm: models that “think” before responding. These models use chain-of-thought reasoning internally, spending more compute (and more tokens) to improve accuracy on hard problems.

For engineers, this means: the same model can operate in two modes. Fast mode for simple tasks. Slow, expensive mode for complex reasoning. Your architecture needs to account for both.

Open-Weight Models Caught Up

Llama 3.1, Mistral Large, and DeepSeek V3 narrowed the gap with proprietary models. For many production use cases — classification, extraction, summarization — a 70B open-weight model running on your infrastructure matches GPT-4-class performance at a fraction of the cost. Fine-tuning an open-weight model on your domain data often outperforms a general-purpose proprietary model.

3. LLM Fundamentals — Core Concepts

Four concepts — what an LLM is, how tokenization works, what embeddings represent, and how attention operates — form the complete mental model for reasoning about LLM behavior.

What Is an LLM, Really?

Forget the Wikipedia definition. Here is the engineer’s mental model:

An LLM is a compressed representation of patterns in text. During training, the model reads trillions of tokens — books, code, websites, conversations — and learns statistical patterns: which tokens tend to follow which other tokens, in what contexts, with what relationships. The result is a function that takes a sequence of tokens as input and outputs a probability distribution over the next token.

That function has billions of parameters (weights). GPT-4 is estimated at ~1.8 trillion parameters across its mixture-of-experts architecture. Claude 3.5 Sonnet and Gemini models do not disclose parameter counts, but they are in the hundreds-of-billions range.

The key insight: the model does not “know” things the way a database knows things. It has learned patterns. When you ask “What is the capital of France?”, the model does not look up France in a table. It has seen the pattern “capital of France” followed by “Paris” so many times during training that “Paris” has an extremely high probability as the next token.

Tokenization: The First Step

Before a model can process your text, it must convert it into tokens. A token is a unit of text — sometimes a word, sometimes part of a word, sometimes a single character or punctuation mark.

How tokenization works:

Most modern models use Byte Pair Encoding (BPE) or SentencePiece. The algorithm:

Start with individual characters as the base vocabulary
Find the most frequently co-occurring pair of tokens in the training data
Merge that pair into a single new token
Repeat until the vocabulary reaches a target size (typically 32K-100K tokens)

Why this matters for engineers:

“Strawberry” = 3 tokens. The tokenizer splits it into str + aw + berry. The model never sees the individual letters s-t-r-a-w-b-e-r-r-y. This is why counting letters is hard for LLMs.
Common words = 1 token. “the”, “and”, “function” are single tokens. Your prompts are cheaper and faster when they use common vocabulary.
Code is expensive. Variable names, syntax, and whitespace in code often tokenize into many small tokens. A 100-line Python file might be 500+ tokens.
Non-English text costs more. Most tokenizers are trained primarily on English text. Japanese, Arabic, and Hindi text can use 2-4x more tokens per word than English.

Token counts determine cost and speed. Every API call is billed by input tokens + output tokens. A GPT-4o call with 1,000 input tokens and 500 output tokens costs differently than one with 10,000 input tokens and 2,000 output tokens. Understanding tokenization is understanding your cost model.

Embeddings: Tokens Become Vectors

Once text is tokenized, each token is converted into a dense vector — a list of numbers that represents the token’s meaning in a high-dimensional space. This vector is called an embedding.

GPT-4’s embeddings are estimated at 12,288 dimensions. Each token becomes a list of 12,288 numbers. The key property: tokens with similar meanings end up close together in this space. “king” and “queen” are close. “king” and “bicycle” are far apart.

Positional encodings are added to tell the model where each token sits in the sequence. Without this, the model would see “dog bites man” and “man bites dog” as identical — the same tokens, just rearranged.

Embeddings are the bridge between human-readable text and the mathematical operations the model performs. You will encounter embeddings again when you study RAG — they are how retrieval systems find relevant documents.

Attention: The Core Innovation

The attention mechanism is what makes transformers work. Here is the intuition without the math.

When processing the word “it” in the sentence “The cat sat on the mat because it was tired”, the model needs to figure out that “it” refers to “the cat”, not “the mat”. Attention solves this.

How attention works (conceptual):

For every token in the sequence, attention asks: “How relevant is every other token to understanding this token?” It computes a relevance score between every pair of tokens, then uses those scores to create a weighted combination of information from all tokens.

Think of it as a lookup. Each token creates three things:

A query: “What am I looking for?”
A key: “What do I contain?”
A value: “What information should I contribute?”

The query of one token is compared against the keys of all other tokens. High-scoring matches contribute more of their value. Low-scoring matches contribute less.

Multi-head attention runs this process multiple times in parallel (GPT-4 uses 96 attention heads). Each head learns to attend to different types of relationships — syntactic, semantic, positional. One head might learn subject-verb agreement. Another might learn coreference resolution. Another might learn that a closing parenthesis should match an opening one.

Why engineers should care:

Attention is why context window size matters. The model computes attention between every pair of tokens. With N tokens, that is N-squared comparisons. Doubling the context window quadruples the compute for attention.
Attention is why prompt structure matters for prompt engineering. Placing relevant information near the query (rather than buried in a long context) helps the model attend to it more effectively.
Attention is why models can “forget” information in very long contexts. With 100K+ tokens, attention scores spread thin. Information in the middle of a long document gets less attention than information at the beginning or end.

4. Deep Dive: The Transformer Pipeline

A transformer processes text through a sequence of layers. Each layer transforms the representation of every token, gradually building up from surface-level patterns (syntax, word identity) to deep patterns (reasoning, world knowledge).

Layer 1: Tokenization and Input Embedding

Raw text enters the model as a string. The tokenizer converts it into a sequence of integer IDs (one per token). Each ID maps to an embedding vector. Positional encodings are added. The result: a matrix where each row is a token’s initial representation.

Layer 2: The Transformer Block (Repeated N Times)

The core of the model is a transformer block, repeated many times. GPT-4 is estimated to have 120 layers. Each block contains:

Multi-head self-attention: Every token attends to every other token. Outputs are a refined representation of each token that incorporates information from the full context.
Feed-forward network (FFN): A per-token transformation. Two linear layers with a nonlinearity in between. This is where researchers believe factual knowledge is stored — the FFN layers act as a compressed knowledge base.
Layer normalization and residual connections: Technical details that stabilize training and allow information to flow through deep networks.

After 120+ rounds of attention and feed-forward processing, each token’s representation encodes rich contextual information about its meaning, its relationships, and its role in the sequence.

Layer 3: Output Projection

The final token’s representation is projected onto the vocabulary — a vector of ~100K numbers, one per possible token. A softmax function converts these into probabilities. The token with the highest probability is the model’s prediction for the next token.

How Generation Works

The model generates text one token at a time. It takes the full input sequence, predicts the next token, appends that token to the sequence, and repeats. This is called autoregressive generation.

A 500-token response requires 500 forward passes through the entire model. This is why output tokens are more expensive than input tokens in most API pricing — each output token requires a full model inference.

5. Architecture Visualization

The diagram below traces text through each stage of the transformer pipeline, from raw input to next-token probability distribution.

The Transformer Pipeline

📊 Visual Explanation

How a Transformer Processes Text

Each layer transforms the input representation. Understanding these layers gives you intuition for prompt engineering, fine-tuning, and debugging.

Input Text

Raw user prompt or document — the starting point

Tokenizer

Splits text into subword tokens (BPE, SentencePiece). GPT-4: ~100K vocabulary

Token Embeddings

Each token becomes a dense vector (768-12288 dims) + positional encoding

Attention Layers

Self-attention computes relationships between all tokens. The core innovation.

Feed-Forward Network

Per-token transformation. Where factual knowledge is stored.

Output Probabilities

Softmax over vocabulary → next token prediction. Temperature controls distribution shape.

Idle

6. LLM Fundamentals Code Examples

These four examples translate LLM concepts directly into engineering decisions you will encounter in production: cost estimation, architecture selection, temperature tuning, and hallucination mitigation.

Example 1: Why Tokenization Affects Your Cost

You are building a customer support chatbot. Your system prompt is 800 tokens. Each user message averages 50 tokens. Each response averages 300 tokens. With GPT-4o pricing at $2.50/M input tokens and $10/M output tokens:

Per conversation turn: (800 + 50) x $2.50/M + 300 x $10/M = $0.002 + $0.003 = $0.005
10-turn conversation: System prompt is resent every turn. Input tokens grow: 800 + (50 x 10) = 1,300 input tokens on the last turn. Total cost: ~$0.06

Optimization: use a shorter system prompt. Reducing it from 800 to 400 tokens saves 40% on input costs across all turns. Use a tokenizer tool (like OpenAI’s tiktoken or Anthropic’s token counter) to measure before and after.

Example 2: Context Window Architecture Decisions

You need to build a Q&A system over 500 internal documents, each averaging 2,000 tokens. Total corpus: 1M tokens.

Option A — Stuff the context: With Gemini 2.0 Pro’s 2M token window, you could literally pass all documents in a single prompt. Cost: ~$1.25 per query at Gemini’s input pricing. Latency: 30-60 seconds. Accuracy on retrieval: the model processes everything, but attention degrades for information in the middle.

Option B — RAG: Embed all documents. Retrieve the top 5 most relevant chunks (~2,000 tokens). Pass only those to the model. Cost: ~$0.01 per query. Latency: 2-5 seconds. Accuracy: depends on retrieval quality, but focused context means stronger attention.

For most production systems, Option B wins. But for ad-hoc analysis of a small corpus, Option A is simpler.

Example 3: Temperature Controls Output Variability

Temperature scales the probability distribution before sampling. Here is what different temperatures do in practice:

Temperature	Behavior	Use Case
0.0	Greedy — always picks the highest-probability token	Classification, extraction, structured output
0.3-0.5	Low variability — sticks close to the most likely tokens	Customer support, factual Q&A
0.7-0.9	Moderate variability — creative but coherent	Content generation, brainstorming
1.0+	High variability — surprising, sometimes incoherent	Creative writing, exploration

Rule of thumb: If the task has one correct answer (extraction, classification), use temperature 0. If the task benefits from variety (writing, ideation), use 0.7. Almost no production system uses temperature above 1.0.

Example 4: Why Models Hallucinate

A model trained on text from 2023 is asked: “Who won the 2025 Super Bowl?” The model has no training data about this event. But it has learned the pattern: “Who won the [year] Super Bowl?” is typically followed by a team name. So it predicts a plausible-sounding team name with high confidence.

This is hallucination. The model is doing exactly what it was trained to do — predict the most likely next tokens. It has no mechanism to say “I don’t know” unless that pattern was reinforced during training (which RLHF partially addresses, but not perfectly).

Practical implications:

Never trust model output for factual claims without verification
Use RAG to ground responses in retrieved documents
Use evaluation pipelines to measure hallucination rates
Set system prompts that instruct the model to say “I don’t know” when uncertain

7. LLM Trade-offs and Decision Framework

Every LLM architecture decision reduces to trade-offs across size, cost, latency, quality, and hosting — the framework below makes those trade-offs explicit.

Model Selection: The 5 Dimensions

Every model selection involves trade-offs across these dimensions:

Dimension	Small Model (7-13B)	Medium Model (30-70B)	Frontier Model (GPT-4o, Claude 3.5)
Latency	<500ms	1-3s	2-10s
Cost per 1M tokens	$0.10-0.30	$0.50-2.00	$2.50-15.00
Reasoning ability	Basic	Good for domain tasks	Strong general reasoning
Hosting	Single GPU	2-4 GPUs	API only (mostly)
Fine-tuning	Easy, cheap	Moderate	Limited or expensive

When to Use Which Model

Use a small model (Llama 3.1 8B, Mistral 7B) when:

The task is narrow and well-defined (classification, extraction, routing)
Latency budget is <1 second
You have training data to fine-tune for your domain
Volume is high enough that API costs would be prohibitive

Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when:

The task requires general reasoning across domains
You need strong instruction-following without fine-tuning
Quality matters more than cost
You are prototyping and need fast iteration

Use a reasoning model (o3, Claude extended thinking) when:

The task involves multi-step logic, math, or code generation
Accuracy on hard problems matters more than latency
You can afford 10-30 second response times

Context Window vs. RAG Decision Matrix

Scenario	Best Approach	Why
Corpus <50K tokens, static	Stuff the context	Simpler, no retrieval infrastructure
Corpus 50K-500K tokens	RAG	Better cost, attention quality
Corpus >500K tokens	RAG + reranking	Must retrieve; full context too expensive
Corpus changes frequently	RAG	Embedding updates cheaper than re-prompting
Need precise citation	RAG	Can track which chunks contributed

8. Interview Questions and Signals

These questions are drawn from real GenAI engineering interviews; the strong answers below reflect what distinguishes candidates who understand the architecture from those who have only used the APIs.

Junior Level (0-2 Years)

Q: What is a token? How does tokenization work?

Strong answer: “A token is a subword unit. The tokenizer uses BPE to build a vocabulary of common subword sequences. Common English words are single tokens. Rare words and code get split into multiple tokens. The vocabulary size is typically 32K-100K. Tokenization affects cost because API pricing is per-token, and it affects model behavior because the model reasons at the token level, not the character level.”

Q: What is the difference between temperature 0 and temperature 1?

Strong answer: “Temperature scales the logit distribution before sampling. At temperature 0, the model always picks the highest-probability token — it is deterministic. At temperature 1, the original distribution is used — lower-probability tokens have a real chance of being selected. Higher temperature means more randomness. I use temperature 0 for extraction and classification, 0.7 for content generation.”

Q: Why do LLMs hallucinate?

Strong answer: “LLMs are trained to predict the next most likely token, not to verify factual accuracy. When asked about something outside their training data or something rare in their training data, they generate plausible-sounding tokens based on pattern matching. The model has no built-in mechanism to distinguish ‘I know this’ from ‘this sounds right.’ RLHF reduces but does not eliminate hallucination.”

Senior Level (3+ Years)

Q: How would you design a system that uses both a small model and a large model?

Strong answer: “I would use a model router pattern. A small, fast model (8B parameters, <200ms latency) handles classification: is this query simple or complex? Simple queries — FAQ lookups, classification, extraction — go to the small model. Complex queries — multi-step reasoning, novel analysis, ambiguous intent — route to a frontier model. This reduces cost by 60-80% while maintaining quality on hard queries. The router itself can be fine-tuned on your traffic distribution.”

Q: Explain the attention mechanism to a product manager.

Strong answer: “When the model reads a sentence, each word looks at every other word to understand context. ‘It’ in ‘The server crashed because it ran out of memory’ needs to know that ‘it’ refers to ‘the server.’ Attention computes these connections automatically. Longer inputs mean more connections to compute, which is why longer prompts are slower and more expensive. And very long contexts can dilute attention — important information in the middle of a 100-page document might get less focus than information at the start or end.”

Q: What are the failure modes of large context windows?

Strong answer: “Three main failure modes. First, the ‘lost in the middle’ problem — models attend more strongly to the beginning and end of long contexts, so information in the middle gets less attention. Second, cost scales linearly with context length, so a 200K token context at GPT-4o pricing is $0.50 per query — that adds up fast. Third, latency increases with context length. For production systems serving thousands of queries, stuffing 200K tokens per request is usually not viable. RAG with targeted retrieval is more practical.”

9. LLM Fundamentals in Production

Running LLMs in production requires active management of token budgets, latency, and the infrastructure decision between managed APIs and self-hosted models.

Token Budget Management

Every production LLM application needs a token budget strategy. Here is a practical framework:

Measure your baseline. Use a tokenizer library to count tokens in your system prompt, average user input, and average model output. Track this as a metric.
Set per-request limits. Cap max_tokens in your API calls. A customer support bot should not generate 4,000-token responses. Set it to 500 and tune from there.
Monitor token usage. Track input and output tokens per request. Alert when averages shift — a prompt change that adds 200 tokens to every request costs real money at scale.
Compress context. Summarize conversation history instead of passing the full transcript. A 20-turn conversation with full history can exceed 10K tokens. A summary fits in 500.

Latency Optimization

Technique	Impact	Complexity
Streaming responses	Perceived latency drops 50-80%	Low
Shorter prompts	Linear reduction in input processing	Low
Model downgrade (GPT-4o mini vs GPT-4o)	2-5x faster	Low
Caching common queries	Near-zero latency for cache hits	Medium
Prompt caching (Anthropic, OpenAI)	50-90% reduction on repeated prefixes	Low
Self-hosted model with batching	Full control over throughput	High

Choosing Between API and Self-Hosted

Use an API (OpenAI, Anthropic, Google) when:

You want zero infrastructure overhead
Your volume is <100K requests/day
You need frontier-model quality
Data privacy requirements allow third-party processing

Self-host (vLLM, TGI, Ollama) when:

Data cannot leave your infrastructure (healthcare, finance, government)
Volume exceeds 100K requests/day and cost matters
You need to fine-tune a model on proprietary data
Latency requirements demand co-located inference

The Model Landscape in 2026

Provider	Key Models	Strengths
OpenAI	GPT-4o, o3, GPT-4o mini	Broad capability, strong tool use, reasoning (o3)
Anthropic	Claude 3.5 Sonnet, Claude 3 Opus	Long context (200K), strong instruction following, safety
Google	Gemini 2.0 Pro, Gemini 2.0 Flash	Massive context (2M), multimodal, competitive pricing
Meta	Llama 3.1 (8B, 70B, 405B)	Open-weight, fine-tunable, self-hostable
Mistral	Mistral Large, Mixtral	European provider, strong multilingual, MoE efficiency
DeepSeek	DeepSeek V3	Open-weight, strong reasoning, very competitive benchmarks

No single model dominates all tasks. Production systems increasingly use multiple models — a fast model for routing, a mid-tier model for common tasks, and a frontier model for hard queries.

10. Summary and What to Learn Next

With LLM fundamentals in place, the logical next steps are prompt engineering, RAG, agents, and evaluation — each builds directly on what you have learned here.

Key Takeaways

LLMs are next-token predictors. Every behavior — good and bad — follows from this objective.
Tokenization is the first bottleneck. It determines cost, speed, and the model’s ability to reason about text.
Attention is the core innovation. It lets every token in the sequence consider every other token.
Context windows are not free. Bigger is not always better — cost, latency, and attention quality all degrade with length.
Temperature controls the randomness-reliability trade-off. Low for accuracy, high for creativity.
Hallucination is structural, not a bug. The model predicts likely tokens, not verified facts. Design your system accordingly.
Model selection is a multi-dimensional trade-off. Size, cost, latency, quality, and hosting constraints all matter.

Where to Go Next

With this foundation, every other topic on the site will make more sense:

Prompt Engineering — Now that you understand tokens and attention, learn how to structure inputs for better outputs.
RAG Architecture — Understand how retrieval augmented generation solves the knowledge limitation of LLMs.
AI Agents — Learn how LLMs go from answering questions to executing multi-step tasks.
LLM Evaluation — Build the testing infrastructure that catches hallucinations and quality regressions.
Fine-Tuning vs RAG — Learn when and how to adapt a model to your specific domain.
Python for GenAI — Set up your development environment for building LLM applications.

Transformer Architecture — Deep dive into attention, layers, and scale
Tokenization Guide — BPE, SentencePiece, and token counting
Embeddings Explained — How text becomes vectors for search
Context Windows — Token budgeting and cost per provider
Prompt Engineering — Structure inputs for better outputs

Frequently Asked Questions

How do large language models work?

A large language model is a next-token prediction machine. It takes a sequence of tokens as input and predicts the most probable next token, repeating this process to generate complete responses. The transformer architecture powers this through self-attention, which lets each token consider its relationship to every other token.

What is tokenization in LLMs?

Tokenization splits text into subword pieces (tokens) that the model processes. Words are split into common subword units — 'programming' might become 'program' and 'ming'. This is why models struggle with tasks like counting letters in 'strawberry' — they never see individual characters, only subword tokens. Learn more in the tokenization guide.

What is attention in transformers?

Attention is the mechanism that lets each token consider its relationship to every other token in the sequence. When processing the word 'bank' in a sentence about finance versus rivers, attention weights shift to disambiguate the meaning. Self-attention scales quadratically with sequence length, which is why context windows have cost and size limits.

Why do LLMs hallucinate facts?

LLMs hallucinate because they predict plausible-sounding next tokens, not verified facts. The model has no mechanism to distinguish between generating a true statement and a false one — both are just probable token sequences. The training objective is to produce text that statistically resembles the training data, not to be factually accurate.

What are embeddings in LLMs?

Embeddings are dense vectors — lists of numbers — that represent a token's meaning in a high-dimensional space. GPT-4's embeddings are estimated at 12,288 dimensions. Tokens with similar meanings end up close together in this space. Learn more in the embeddings guide.

What is the difference between temperature 0 and temperature 1 in LLMs?

Temperature scales the probability distribution before sampling the next token. At temperature 0, the model always picks the highest-probability token (deterministic, greedy). At temperature 1, the original distribution is used, giving lower-probability tokens a chance. Use temperature 0 for extraction and classification; use 0.7 for content generation.

How do context windows affect LLM cost and performance?

Context windows determine how many tokens a model can process in a single request. Larger windows allow processing more text, but cost scales linearly and attention quality can degrade. The 'lost in the middle' problem means models attend more strongly to the beginning and end of long inputs. See the context windows guide for details.

When should I use RAG instead of a large context window?

For corpora under 50K tokens, stuffing the context window is simpler. For corpora above 50K tokens, RAG with targeted retrieval is cheaper, faster, and often more accurate. RAG also wins when your corpus changes frequently or when you need precise citation tracking.

How do I choose between a small model and a frontier model?

Use a small model (7-13B parameters) when the task is narrow, latency must be under 1 second, you have fine-tuning data, or volume makes API costs prohibitive. Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when the task requires general reasoning or strong instruction-following without fine-tuning.

What is the transformer pipeline from input text to output?

The transformer pipeline has three stages. First, raw text is split into tokens and converted into embedding vectors with positional encoding. Second, representations pass through repeated transformer blocks containing multi-head self-attention and feed-forward networks. Finally, the last token's representation is projected onto the vocabulary to produce a next-token probability distribution.