LLM Fundamentals — Tokens, Attention & Transformers (2026)
Every GenAI engineering skill you will learn — prompt engineering, RAG, agents, evaluation — sits on top of one foundation: understanding how large language models actually work. Not the math. The mental models. This page gives you the intuition that makes everything else click.
Who this is for:
- Software engineers switching into GenAI: You can code. You need the conceptual foundation to reason about LLMs, not a PhD-level paper walkthrough.
- Junior GenAI engineers: You’ve used ChatGPT and maybe built a basic RAG app. You want to understand why your system behaves the way it does.
- Senior engineers in interviews: You need to explain transformers, attention, and tokenization clearly and concisely under pressure.
1. Why LLM Fundamentals Matter
Section titled “1. Why LLM Fundamentals Matter”A large language model is a next-token prediction machine. That is the single most important sentence in this guide. Everything a model does — answering questions, writing code, translating languages, summarizing documents — is the result of predicting the next token in a sequence, over and over.
This sounds reductive. It is not. The transformer architecture that powers modern LLMs achieves extraordinary capability from this simple objective. But understanding the objective helps you reason about the system’s behavior:
- Why does a model sometimes “hallucinate” facts? Because it predicts plausible-sounding next tokens, not verified facts.
- Why does a longer prompt cost more? Because the model processes every token in the input.
- Why does rephrasing a prompt change the output? Because different tokens activate different patterns learned during training.
- Why do models struggle with counting letters in “strawberry”? Because tokenization splits words into subword pieces, and the model never sees individual characters.
You do not need linear algebra to understand these dynamics. You need the right mental models.
2. What Changed in 2025-2026
Section titled “2. What Changed in 2025-2026”The 2025-2026 period brought three shifts that directly change how you architect LLM systems: larger context windows, reasoning models, and competitive open-weight alternatives.
Context Windows Exploded
Section titled “Context Windows Exploded”In 2023, 4K tokens was standard. GPT-4 offered 32K as a premium tier. By early 2026, the landscape looks radically different:
| Model | Context Window | Approx. Pages of Text |
|---|---|---|
| GPT-4o | 128K tokens | ~300 pages |
| Claude 3.5 Sonnet | 200K tokens | ~500 pages |
| Gemini 2.0 Pro | 2M tokens | ~5,000 pages |
| Llama 3.1 405B | 128K tokens | ~300 pages |
This changes architecture decisions. Many use cases that required RAG in 2023 can now be solved by stuffing the entire document into the context window. But context window size is not free — cost scales linearly, and retrieval quality can degrade with very long contexts.
Reasoning Models Emerged
Section titled “Reasoning Models Emerged”OpenAI’s o1 and o3, Anthropic’s Claude extended thinking, and Google’s Gemini thinking mode introduced a new paradigm: models that “think” before responding. These models use chain-of-thought reasoning internally, spending more compute (and more tokens) to improve accuracy on hard problems.
For engineers, this means: the same model can operate in two modes. Fast mode for simple tasks. Slow, expensive mode for complex reasoning. Your architecture needs to account for both.
Open-Weight Models Caught Up
Section titled “Open-Weight Models Caught Up”Llama 3.1, Mistral Large, and DeepSeek V3 narrowed the gap with proprietary models. For many production use cases — classification, extraction, summarization — a 70B open-weight model running on your infrastructure matches GPT-4-class performance at a fraction of the cost. Fine-tuning an open-weight model on your domain data often outperforms a general-purpose proprietary model.
3. LLM Fundamentals — Core Concepts
Section titled “3. LLM Fundamentals — Core Concepts”Four concepts — what an LLM is, how tokenization works, what embeddings represent, and how attention operates — form the complete mental model for reasoning about LLM behavior.
What Is an LLM, Really?
Section titled “What Is an LLM, Really?”Forget the Wikipedia definition. Here is the engineer’s mental model:
An LLM is a compressed representation of patterns in text. During training, the model reads trillions of tokens — books, code, websites, conversations — and learns statistical patterns: which tokens tend to follow which other tokens, in what contexts, with what relationships. The result is a function that takes a sequence of tokens as input and outputs a probability distribution over the next token.
That function has billions of parameters (weights). GPT-4 is estimated at ~1.8 trillion parameters across its mixture-of-experts architecture. Claude 3.5 Sonnet and Gemini models do not disclose parameter counts, but they are in the hundreds-of-billions range.
The key insight: the model does not “know” things the way a database knows things. It has learned patterns. When you ask “What is the capital of France?”, the model does not look up France in a table. It has seen the pattern “capital of France” followed by “Paris” so many times during training that “Paris” has an extremely high probability as the next token.
Tokenization: The First Step
Section titled “Tokenization: The First Step”Before a model can process your text, it must convert it into tokens. A token is a unit of text — sometimes a word, sometimes part of a word, sometimes a single character or punctuation mark.
How tokenization works:
Most modern models use Byte Pair Encoding (BPE) or SentencePiece. The algorithm:
- Start with individual characters as the base vocabulary
- Find the most frequently co-occurring pair of tokens in the training data
- Merge that pair into a single new token
- Repeat until the vocabulary reaches a target size (typically 32K-100K tokens)
Why this matters for engineers:
- “Strawberry” = 3 tokens. The tokenizer splits it into
str+aw+berry. The model never sees the individual letters s-t-r-a-w-b-e-r-r-y. This is why counting letters is hard for LLMs. - Common words = 1 token. “the”, “and”, “function” are single tokens. Your prompts are cheaper and faster when they use common vocabulary.
- Code is expensive. Variable names, syntax, and whitespace in code often tokenize into many small tokens. A 100-line Python file might be 500+ tokens.
- Non-English text costs more. Most tokenizers are trained primarily on English text. Japanese, Arabic, and Hindi text can use 2-4x more tokens per word than English.
Token counts determine cost and speed. Every API call is billed by input tokens + output tokens. A GPT-4o call with 1,000 input tokens and 500 output tokens costs differently than one with 10,000 input tokens and 2,000 output tokens. Understanding tokenization is understanding your cost model.
Embeddings: Tokens Become Vectors
Section titled “Embeddings: Tokens Become Vectors”Once text is tokenized, each token is converted into a dense vector — a list of numbers that represents the token’s meaning in a high-dimensional space. This vector is called an embedding.
GPT-4’s embeddings are estimated at 12,288 dimensions. Each token becomes a list of 12,288 numbers. The key property: tokens with similar meanings end up close together in this space. “king” and “queen” are close. “king” and “bicycle” are far apart.
Positional encodings are added to tell the model where each token sits in the sequence. Without this, the model would see “dog bites man” and “man bites dog” as identical — the same tokens, just rearranged.
Embeddings are the bridge between human-readable text and the mathematical operations the model performs. You will encounter embeddings again when you study RAG — they are how retrieval systems find relevant documents.
Attention: The Core Innovation
Section titled “Attention: The Core Innovation”The attention mechanism is what makes transformers work. Here is the intuition without the math.
When processing the word “it” in the sentence “The cat sat on the mat because it was tired”, the model needs to figure out that “it” refers to “the cat”, not “the mat”. Attention solves this.
How attention works (conceptual):
For every token in the sequence, attention asks: “How relevant is every other token to understanding this token?” It computes a relevance score between every pair of tokens, then uses those scores to create a weighted combination of information from all tokens.
Think of it as a lookup. Each token creates three things:
- A query: “What am I looking for?”
- A key: “What do I contain?”
- A value: “What information should I contribute?”
The query of one token is compared against the keys of all other tokens. High-scoring matches contribute more of their value. Low-scoring matches contribute less.
Multi-head attention runs this process multiple times in parallel (GPT-4 uses 96 attention heads). Each head learns to attend to different types of relationships — syntactic, semantic, positional. One head might learn subject-verb agreement. Another might learn coreference resolution. Another might learn that a closing parenthesis should match an opening one.
Why engineers should care:
- Attention is why context window size matters. The model computes attention between every pair of tokens. With N tokens, that is N-squared comparisons. Doubling the context window quadruples the compute for attention.
- Attention is why prompt structure matters for prompt engineering. Placing relevant information near the query (rather than buried in a long context) helps the model attend to it more effectively.
- Attention is why models can “forget” information in very long contexts. With 100K+ tokens, attention scores spread thin. Information in the middle of a long document gets less attention than information at the beginning or end.
4. Deep Dive: The Transformer Pipeline
Section titled “4. Deep Dive: The Transformer Pipeline”A transformer processes text through a sequence of layers. Each layer transforms the representation of every token, gradually building up from surface-level patterns (syntax, word identity) to deep patterns (reasoning, world knowledge).
Layer 1: Tokenization and Input Embedding
Section titled “Layer 1: Tokenization and Input Embedding”Raw text enters the model as a string. The tokenizer converts it into a sequence of integer IDs (one per token). Each ID maps to an embedding vector. Positional encodings are added. The result: a matrix where each row is a token’s initial representation.
Layer 2: The Transformer Block (Repeated N Times)
Section titled “Layer 2: The Transformer Block (Repeated N Times)”The core of the model is a transformer block, repeated many times. GPT-4 is estimated to have 120 layers. Each block contains:
- Multi-head self-attention: Every token attends to every other token. Outputs are a refined representation of each token that incorporates information from the full context.
- Feed-forward network (FFN): A per-token transformation. Two linear layers with a nonlinearity in between. This is where researchers believe factual knowledge is stored — the FFN layers act as a compressed knowledge base.
- Layer normalization and residual connections: Technical details that stabilize training and allow information to flow through deep networks.
After 120+ rounds of attention and feed-forward processing, each token’s representation encodes rich contextual information about its meaning, its relationships, and its role in the sequence.
Layer 3: Output Projection
Section titled “Layer 3: Output Projection”The final token’s representation is projected onto the vocabulary — a vector of ~100K numbers, one per possible token. A softmax function converts these into probabilities. The token with the highest probability is the model’s prediction for the next token.
How Generation Works
Section titled “How Generation Works”The model generates text one token at a time. It takes the full input sequence, predicts the next token, appends that token to the sequence, and repeats. This is called autoregressive generation.
A 500-token response requires 500 forward passes through the entire model. This is why output tokens are more expensive than input tokens in most API pricing — each output token requires a full model inference.
5. Architecture Visualization
Section titled “5. Architecture Visualization”The diagram below traces text through each stage of the transformer pipeline, from raw input to next-token probability distribution.
The Transformer Pipeline
Section titled “The Transformer Pipeline”📊 Visual Explanation
Section titled “📊 Visual Explanation”How a Transformer Processes Text
Each layer transforms the input representation. Understanding these layers gives you intuition for prompt engineering, fine-tuning, and debugging.
6. LLM Fundamentals Code Examples
Section titled “6. LLM Fundamentals Code Examples”These four examples translate LLM concepts directly into engineering decisions you will encounter in production: cost estimation, architecture selection, temperature tuning, and hallucination mitigation.
Example 1: Why Tokenization Affects Your Cost
Section titled “Example 1: Why Tokenization Affects Your Cost”You are building a customer support chatbot. Your system prompt is 800 tokens. Each user message averages 50 tokens. Each response averages 300 tokens. With GPT-4o pricing at $2.50/M input tokens and $10/M output tokens:
- Per conversation turn: (800 + 50) x $2.50/M + 300 x $10/M = $0.002 + $0.003 = $0.005
- 10-turn conversation: System prompt is resent every turn. Input tokens grow: 800 + (50 x 10) = 1,300 input tokens on the last turn. Total cost: ~$0.06
Optimization: use a shorter system prompt. Reducing it from 800 to 400 tokens saves 40% on input costs across all turns. Use a tokenizer tool (like OpenAI’s tiktoken or Anthropic’s token counter) to measure before and after.
Example 2: Context Window Architecture Decisions
Section titled “Example 2: Context Window Architecture Decisions”You need to build a Q&A system over 500 internal documents, each averaging 2,000 tokens. Total corpus: 1M tokens.
Option A — Stuff the context: With Gemini 2.0 Pro’s 2M token window, you could literally pass all documents in a single prompt. Cost: ~$1.25 per query at Gemini’s input pricing. Latency: 30-60 seconds. Accuracy on retrieval: the model processes everything, but attention degrades for information in the middle.
Option B — RAG: Embed all documents. Retrieve the top 5 most relevant chunks (~2,000 tokens). Pass only those to the model. Cost: ~$0.01 per query. Latency: 2-5 seconds. Accuracy: depends on retrieval quality, but focused context means stronger attention.
For most production systems, Option B wins. But for ad-hoc analysis of a small corpus, Option A is simpler.
Example 3: Temperature Controls Output Variability
Section titled “Example 3: Temperature Controls Output Variability”Temperature scales the probability distribution before sampling. Here is what different temperatures do in practice:
| Temperature | Behavior | Use Case |
|---|---|---|
| 0.0 | Greedy — always picks the highest-probability token | Classification, extraction, structured output |
| 0.3-0.5 | Low variability — sticks close to the most likely tokens | Customer support, factual Q&A |
| 0.7-0.9 | Moderate variability — creative but coherent | Content generation, brainstorming |
| 1.0+ | High variability — surprising, sometimes incoherent | Creative writing, exploration |
Rule of thumb: If the task has one correct answer (extraction, classification), use temperature 0. If the task benefits from variety (writing, ideation), use 0.7. Almost no production system uses temperature above 1.0.
Example 4: Why Models Hallucinate
Section titled “Example 4: Why Models Hallucinate”A model trained on text from 2023 is asked: “Who won the 2025 Super Bowl?” The model has no training data about this event. But it has learned the pattern: “Who won the [year] Super Bowl?” is typically followed by a team name. So it predicts a plausible-sounding team name with high confidence.
This is hallucination. The model is doing exactly what it was trained to do — predict the most likely next tokens. It has no mechanism to say “I don’t know” unless that pattern was reinforced during training (which RLHF partially addresses, but not perfectly).
Practical implications:
- Never trust model output for factual claims without verification
- Use RAG to ground responses in retrieved documents
- Use evaluation pipelines to measure hallucination rates
- Set system prompts that instruct the model to say “I don’t know” when uncertain
7. LLM Trade-offs and Decision Framework
Section titled “7. LLM Trade-offs and Decision Framework”Every LLM architecture decision reduces to trade-offs across size, cost, latency, quality, and hosting — the framework below makes those trade-offs explicit.
Model Selection: The 5 Dimensions
Section titled “Model Selection: The 5 Dimensions”Every model selection involves trade-offs across these dimensions:
| Dimension | Small Model (7-13B) | Medium Model (30-70B) | Frontier Model (GPT-4o, Claude 3.5) |
|---|---|---|---|
| Latency | <500ms | 1-3s | 2-10s |
| Cost per 1M tokens | $0.10-0.30 | $0.50-2.00 | $2.50-15.00 |
| Reasoning ability | Basic | Good for domain tasks | Strong general reasoning |
| Hosting | Single GPU | 2-4 GPUs | API only (mostly) |
| Fine-tuning | Easy, cheap | Moderate | Limited or expensive |
When to Use Which Model
Section titled “When to Use Which Model”Use a small model (Llama 3.1 8B, Mistral 7B) when:
- The task is narrow and well-defined (classification, extraction, routing)
- Latency budget is <1 second
- You have training data to fine-tune for your domain
- Volume is high enough that API costs would be prohibitive
Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when:
- The task requires general reasoning across domains
- You need strong instruction-following without fine-tuning
- Quality matters more than cost
- You are prototyping and need fast iteration
Use a reasoning model (o3, Claude extended thinking) when:
- The task involves multi-step logic, math, or code generation
- Accuracy on hard problems matters more than latency
- You can afford 10-30 second response times
Context Window vs. RAG Decision Matrix
Section titled “Context Window vs. RAG Decision Matrix”| Scenario | Best Approach | Why |
|---|---|---|
| Corpus <50K tokens, static | Stuff the context | Simpler, no retrieval infrastructure |
| Corpus 50K-500K tokens | RAG | Better cost, attention quality |
| Corpus >500K tokens | RAG + reranking | Must retrieve; full context too expensive |
| Corpus changes frequently | RAG | Embedding updates cheaper than re-prompting |
| Need precise citation | RAG | Can track which chunks contributed |
8. Interview Questions and Signals
Section titled “8. Interview Questions and Signals”These questions are drawn from real GenAI engineering interviews; the strong answers below reflect what distinguishes candidates who understand the architecture from those who have only used the APIs.
Junior Level (0-2 Years)
Section titled “Junior Level (0-2 Years)”Q: What is a token? How does tokenization work?
Strong answer: “A token is a subword unit. The tokenizer uses BPE to build a vocabulary of common subword sequences. Common English words are single tokens. Rare words and code get split into multiple tokens. The vocabulary size is typically 32K-100K. Tokenization affects cost because API pricing is per-token, and it affects model behavior because the model reasons at the token level, not the character level.”
Q: What is the difference between temperature 0 and temperature 1?
Strong answer: “Temperature scales the logit distribution before sampling. At temperature 0, the model always picks the highest-probability token — it is deterministic. At temperature 1, the original distribution is used — lower-probability tokens have a real chance of being selected. Higher temperature means more randomness. I use temperature 0 for extraction and classification, 0.7 for content generation.”
Q: Why do LLMs hallucinate?
Strong answer: “LLMs are trained to predict the next most likely token, not to verify factual accuracy. When asked about something outside their training data or something rare in their training data, they generate plausible-sounding tokens based on pattern matching. The model has no built-in mechanism to distinguish ‘I know this’ from ‘this sounds right.’ RLHF reduces but does not eliminate hallucination.”
Senior Level (3+ Years)
Section titled “Senior Level (3+ Years)”Q: How would you design a system that uses both a small model and a large model?
Strong answer: “I would use a model router pattern. A small, fast model (8B parameters, <200ms latency) handles classification: is this query simple or complex? Simple queries — FAQ lookups, classification, extraction — go to the small model. Complex queries — multi-step reasoning, novel analysis, ambiguous intent — route to a frontier model. This reduces cost by 60-80% while maintaining quality on hard queries. The router itself can be fine-tuned on your traffic distribution.”
Q: Explain the attention mechanism to a product manager.
Strong answer: “When the model reads a sentence, each word looks at every other word to understand context. ‘It’ in ‘The server crashed because it ran out of memory’ needs to know that ‘it’ refers to ‘the server.’ Attention computes these connections automatically. Longer inputs mean more connections to compute, which is why longer prompts are slower and more expensive. And very long contexts can dilute attention — important information in the middle of a 100-page document might get less focus than information at the start or end.”
Q: What are the failure modes of large context windows?
Strong answer: “Three main failure modes. First, the ‘lost in the middle’ problem — models attend more strongly to the beginning and end of long contexts, so information in the middle gets less attention. Second, cost scales linearly with context length, so a 200K token context at GPT-4o pricing is $0.50 per query — that adds up fast. Third, latency increases with context length. For production systems serving thousands of queries, stuffing 200K tokens per request is usually not viable. RAG with targeted retrieval is more practical.”
9. LLM Fundamentals in Production
Section titled “9. LLM Fundamentals in Production”Running LLMs in production requires active management of token budgets, latency, and the infrastructure decision between managed APIs and self-hosted models.
Token Budget Management
Section titled “Token Budget Management”Every production LLM application needs a token budget strategy. Here is a practical framework:
- Measure your baseline. Use a tokenizer library to count tokens in your system prompt, average user input, and average model output. Track this as a metric.
- Set per-request limits. Cap
max_tokensin your API calls. A customer support bot should not generate 4,000-token responses. Set it to 500 and tune from there. - Monitor token usage. Track input and output tokens per request. Alert when averages shift — a prompt change that adds 200 tokens to every request costs real money at scale.
- Compress context. Summarize conversation history instead of passing the full transcript. A 20-turn conversation with full history can exceed 10K tokens. A summary fits in 500.
Latency Optimization
Section titled “Latency Optimization”| Technique | Impact | Complexity |
|---|---|---|
| Streaming responses | Perceived latency drops 50-80% | Low |
| Shorter prompts | Linear reduction in input processing | Low |
| Model downgrade (GPT-4o mini vs GPT-4o) | 2-5x faster | Low |
| Caching common queries | Near-zero latency for cache hits | Medium |
| Prompt caching (Anthropic, OpenAI) | 50-90% reduction on repeated prefixes | Low |
| Self-hosted model with batching | Full control over throughput | High |
Choosing Between API and Self-Hosted
Section titled “Choosing Between API and Self-Hosted”Use an API (OpenAI, Anthropic, Google) when:
- You want zero infrastructure overhead
- Your volume is <100K requests/day
- You need frontier-model quality
- Data privacy requirements allow third-party processing
Self-host (vLLM, TGI, Ollama) when:
- Data cannot leave your infrastructure (healthcare, finance, government)
- Volume exceeds 100K requests/day and cost matters
- You need to fine-tune a model on proprietary data
- Latency requirements demand co-located inference
The Model Landscape in 2026
Section titled “The Model Landscape in 2026”| Provider | Key Models | Strengths |
|---|---|---|
| OpenAI | GPT-4o, o3, GPT-4o mini | Broad capability, strong tool use, reasoning (o3) |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus | Long context (200K), strong instruction following, safety |
| Gemini 2.0 Pro, Gemini 2.0 Flash | Massive context (2M), multimodal, competitive pricing | |
| Meta | Llama 3.1 (8B, 70B, 405B) | Open-weight, fine-tunable, self-hostable |
| Mistral | Mistral Large, Mixtral | European provider, strong multilingual, MoE efficiency |
| DeepSeek | DeepSeek V3 | Open-weight, strong reasoning, very competitive benchmarks |
No single model dominates all tasks. Production systems increasingly use multiple models — a fast model for routing, a mid-tier model for common tasks, and a frontier model for hard queries.
10. Summary and What to Learn Next
Section titled “10. Summary and What to Learn Next”With LLM fundamentals in place, the logical next steps are prompt engineering, RAG, agents, and evaluation — each builds directly on what you have learned here.
Key Takeaways
Section titled “Key Takeaways”- LLMs are next-token predictors. Every behavior — good and bad — follows from this objective.
- Tokenization is the first bottleneck. It determines cost, speed, and the model’s ability to reason about text.
- Attention is the core innovation. It lets every token in the sequence consider every other token.
- Context windows are not free. Bigger is not always better — cost, latency, and attention quality all degrade with length.
- Temperature controls the randomness-reliability trade-off. Low for accuracy, high for creativity.
- Hallucination is structural, not a bug. The model predicts likely tokens, not verified facts. Design your system accordingly.
- Model selection is a multi-dimensional trade-off. Size, cost, latency, quality, and hosting constraints all matter.
Where to Go Next
Section titled “Where to Go Next”With this foundation, every other topic on the site will make more sense:
- Prompt Engineering — Now that you understand tokens and attention, learn how to structure inputs for better outputs.
- RAG Architecture — Understand how retrieval augmented generation solves the knowledge limitation of LLMs.
- AI Agents — Learn how LLMs go from answering questions to executing multi-step tasks.
- LLM Evaluation — Build the testing infrastructure that catches hallucinations and quality regressions.
- Fine-Tuning vs RAG — Learn when and how to adapt a model to your specific domain.
- Python for GenAI — Set up your development environment for building LLM applications.
Related
Section titled “Related”- Transformer Architecture — Deep dive into attention, layers, and scale
- Tokenization Guide — BPE, SentencePiece, and token counting
- Embeddings Explained — How text becomes vectors for search
- Context Windows — Token budgeting and cost per provider
- Prompt Engineering — Structure inputs for better outputs
Frequently Asked Questions
How do large language models work?
A large language model is a next-token prediction machine. It takes a sequence of tokens as input and predicts the most probable next token, repeating this process to generate complete responses. The transformer architecture powers this through self-attention, which lets each token consider its relationship to every other token.
What is tokenization in LLMs?
Tokenization splits text into subword pieces (tokens) that the model processes. Words are split into common subword units — 'programming' might become 'program' and 'ming'. This is why models struggle with tasks like counting letters in 'strawberry' — they never see individual characters, only subword tokens. Learn more in the tokenization guide.
What is attention in transformers?
Attention is the mechanism that lets each token consider its relationship to every other token in the sequence. When processing the word 'bank' in a sentence about finance versus rivers, attention weights shift to disambiguate the meaning. Self-attention scales quadratically with sequence length, which is why context windows have cost and size limits.
Why do LLMs hallucinate facts?
LLMs hallucinate because they predict plausible-sounding next tokens, not verified facts. The model has no mechanism to distinguish between generating a true statement and a false one — both are just probable token sequences. The training objective is to produce text that statistically resembles the training data, not to be factually accurate.
What are embeddings in LLMs?
Embeddings are dense vectors — lists of numbers — that represent a token's meaning in a high-dimensional space. GPT-4's embeddings are estimated at 12,288 dimensions. Tokens with similar meanings end up close together in this space. Learn more in the embeddings guide.
What is the difference between temperature 0 and temperature 1 in LLMs?
Temperature scales the probability distribution before sampling the next token. At temperature 0, the model always picks the highest-probability token (deterministic, greedy). At temperature 1, the original distribution is used, giving lower-probability tokens a chance. Use temperature 0 for extraction and classification; use 0.7 for content generation.
How do context windows affect LLM cost and performance?
Context windows determine how many tokens a model can process in a single request. Larger windows allow processing more text, but cost scales linearly and attention quality can degrade. The 'lost in the middle' problem means models attend more strongly to the beginning and end of long inputs. See the context windows guide for details.
When should I use RAG instead of a large context window?
For corpora under 50K tokens, stuffing the context window is simpler. For corpora above 50K tokens, RAG with targeted retrieval is cheaper, faster, and often more accurate. RAG also wins when your corpus changes frequently or when you need precise citation tracking.
How do I choose between a small model and a frontier model?
Use a small model (7-13B parameters) when the task is narrow, latency must be under 1 second, you have fine-tuning data, or volume makes API costs prohibitive. Use a frontier model (GPT-4o, Claude 3.5 Sonnet) when the task requires general reasoning or strong instruction-following without fine-tuning.
What is the transformer pipeline from input text to output?
The transformer pipeline has three stages. First, raw text is split into tokens and converted into embedding vectors with positional encoding. Second, representations pass through repeated transformer blocks containing multi-head self-attention and feed-forward networks. Finally, the last token's representation is projected onto the vocabulary to produce a next-token probability distribution.