Context Windows & Cost — Token Budgeting Guide (2026)
This context windows guide covers everything you need to manage token budgets in production LLM applications. You will learn exactly how much each provider charges, how to allocate tokens across system prompts, user input, RAG context, and output — and how to compress context when the window runs out before your wallet does.
1. Why Context Windows Matter
Section titled “1. Why Context Windows Matter”Context windows are both a cost lever and a technical constraint: every token costs money, every token beyond the limit is silently dropped, and engineers who ignore this ship applications that are too expensive or too broken to run.
Why Context Windows Are a Cost Problem, Not Just a Technical Limit
Section titled “Why Context Windows Are a Cost Problem, Not Just a Technical Limit”128K tokens sounds unlimited. It is not. At GPT-4o’s input pricing of $2.50 per million tokens, filling a 128K context window costs $0.32 per request. Run that 10,000 times a day and you are spending $3,200 daily — $96,000 per month — on input tokens alone.
Context windows define how much text a model can “see” in a single request. Every token in that window costs money. Every token beyond it gets silently dropped. Engineers who treat context windows as infinite end up with applications that are either too expensive to run or too broken to ship.
The skill is not filling the context window. The skill is spending each token where it matters most. That is token budgeting — and it separates production-ready GenAI engineers from tutorial-followers.
For how context windows fit into the broader production architecture, see GenAI System Design.
2. What’s New in 2026
Section titled “2. What’s New in 2026”| Development | Impact |
|---|---|
| Gemini 1.5 Pro (2M tokens) | Largest production context window. Process entire codebases or book-length documents in one request |
| Claude 3.5 Sonnet (200K tokens) | Best quality-per-token ratio for long context. Extended thinking uses additional tokens |
| GPT-4o (128K tokens) | Fastest price drops — $2.50/M input, $10/M output. 50% cheaper than 6 months ago |
| Llama 3.1 (128K tokens) | Open-weight model matching proprietary context lengths. Self-host to eliminate per-token cost |
| Context caching (Anthropic, Google) | Cache repeated prefixes at 90% discount. System prompts and few-shot examples nearly free on repeat calls |
| Sparse attention models | Research models handle 10M+ tokens. Not yet production-ready but closing the gap |
3. Real-World Problem Context
Section titled “3. Real-World Problem Context”A naive context strategy — stuffing entire documents into every request — routinely produces five-figure monthly bills that token budgeting can reduce by 80–90% with no quality loss.
The Bill That Changed Everything
Section titled “The Bill That Changed Everything”A fintech startup built a document analysis feature using GPT-4o. Each customer uploaded 50-page financial reports. The naive approach: stuff the entire document into the context window, ask the model to summarize.
The math: 50 pages is roughly 25,000 tokens. At $2.50/M input + $10/M output (500 token response), each request costs $0.068. With 5,000 daily users averaging 3 reports each, the monthly bill hit $30,600 — just for this one feature.
After implementing token budgeting (extracting only relevant sections via RAG, compressing context, caching system prompts), they cut costs to $4,100/month — an 87% reduction with no measurable quality loss.
This is not an edge case. Every production LLM application faces this trade-off between context size, quality, and cost.
How Transformers Use Context
Section titled “How Transformers Use Context”Context windows exist because of how transformer attention works. Each token attends to every other token in the window. That means attention computation scales quadratically: doubling the context from 64K to 128K quadruples the attention cost.
Modern models use optimizations like Flash Attention, sliding window attention, and grouped query attention (GQA) to reduce this cost. But the fundamental constraint remains: more context = more compute = higher latency = higher price.
| Context Length | Relative Attention Cost | Typical Latency |
|---|---|---|
| 4K tokens | 1x (baseline) | <1s |
| 16K tokens | 16x | 1-2s |
| 64K tokens | 256x | 3-6s |
| 128K tokens | 1,024x | 5-15s |
| 1M tokens | 62,500x | 30-60s |
These numbers explain why providers charge more for longer contexts. You are paying for GPU time, not just storage.
4. Context Windows Core Concepts
Section titled “4. Context Windows Core Concepts”Every LLM request has four token consumers — system prompt, user input, RAG context, and reserved output — and each must have an explicit budget before you write a line of application code.
Token Budgeting Framework
Section titled “Token Budgeting Framework”Every LLM request has four token consumers. Your budget must account for all of them:
Total Context Window = System Prompt + User Input + RAG Context + Reserved OutputHere is a practical budget for a 128K-token window:
| Component | Token Allocation | Percentage | Purpose |
|---|---|---|---|
| System prompt | 2,000 | 1.6% | Instructions, persona, format rules |
| Few-shot examples | 3,000 | 2.3% | Quality anchors (2-3 examples) |
| RAG context | 40,000 | 31.3% | Retrieved documents and chunks |
| User input | 15,000 | 11.7% | Current query + conversation history |
| Safety buffer | 4,000 | 3.1% | Overflow protection |
| Reserved output | 64,000 | 50% | Model response generation |
The most common mistake: not reserving enough output tokens. If you fill 120K of a 128K window with input, the model can only generate 8K tokens of output. Long analyses, code generation, and structured responses need 16K-64K output tokens.
Cost Per Provider (March 2026)
Section titled “Cost Per Provider (March 2026)”| Provider / Model | Context Window | Input ($/M tokens) | Output ($/M tokens) | Cost to Fill Window (input only) |
|---|---|---|---|---|
| GPT-4o | 128K | $2.50 | $10.00 | $0.32 |
| GPT-4o mini | 128K | $0.15 | $0.60 | $0.019 |
| Claude 3.5 Sonnet | 200K | $3.00 | $15.00 | $0.60 |
| Claude 3.5 Haiku | 200K | $0.80 | $4.00 | $0.16 |
| Gemini 1.5 Pro | 2M | $1.25 | $5.00 | $2.50 |
| Gemini 1.5 Flash | 1M | $0.075 | $0.30 | $0.075 |
| Llama 3.1 (self-hosted) | 128K | ~$0.20* | ~$0.20* | $0.026 |
Self-hosted costs vary by GPU. Estimate based on A100 at $2/hr processing ~10M tokens/hr.
The “Lost in the Middle” Problem
Section titled “The “Lost in the Middle” Problem”Models do not treat all positions in the context window equally. Research from Liu et al. (2023) showed that LLMs perform best on information at the beginning and end of the context, but degrade on information in the middle.
In a 128K context window, the accuracy difference is significant:
| Information Position | Retrieval Accuracy |
|---|---|
| First 10% of context | 85-95% |
| Middle 40-60% of context | 55-70% |
| Last 10% of context | 80-90% |
Practical implication: Put your most important information (system instructions, critical context) at the start. Put the user’s current question at the end. Treat the middle as lower-priority space.
5. Technical Deep Dive
Section titled “5. Technical Deep Dive”Accurate token counting and context compression are the two implementation skills that turn a token budget from a spreadsheet estimate into enforced per-request limits.
Token Counting with tiktoken
Section titled “Token Counting with tiktoken”Before you can budget tokens, you need to count them. Python’s tiktoken library gives you exact counts for OpenAI models. For Claude, use Anthropic’s tokenizer. For rough estimates, 1 token is approximately 4 characters in English.
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int: """Count tokens for a given model.""" encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text))
def budget_context( system_prompt: str, user_input: str, rag_chunks: list[str], model: str = "gpt-4o", max_context: int = 128_000, reserved_output: int = 4_096,) -> dict: """Calculate token budget and trim RAG context if needed.""" system_tokens = count_tokens(system_prompt, model) user_tokens = count_tokens(user_input, model) safety_buffer = 500
available_for_rag = ( max_context - system_tokens - user_tokens - reserved_output - safety_buffer )
selected_chunks = [] used_tokens = 0 for chunk in rag_chunks: chunk_tokens = count_tokens(chunk, model) if used_tokens + chunk_tokens > available_for_rag: break selected_chunks.append(chunk) used_tokens += chunk_tokens
return { "system_tokens": system_tokens, "user_tokens": user_tokens, "rag_tokens": used_tokens, "rag_chunks_used": len(selected_chunks), "rag_chunks_dropped": len(rag_chunks) - len(selected_chunks), "reserved_output": reserved_output, "total_input": system_tokens + user_tokens + used_tokens, "remaining_buffer": available_for_rag - used_tokens + safety_buffer, }
# Example usagebudget = budget_context( system_prompt="You are a helpful financial analyst...", user_input="Summarize the Q3 revenue trends from this report.", rag_chunks=["Revenue in Q3 grew by 12%...", "Operating costs..."],)print(f"Total input tokens: {budget['total_input']:,}")print(f"RAG chunks used: {budget['rag_chunks_used']}")print(f"Chunks dropped: {budget['rag_chunks_dropped']}")Context Compression Techniques
Section titled “Context Compression Techniques”When your content exceeds the budget, you have four strategies:
1. Extractive compression — keep only relevant sentences
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
def compress_by_relevance( document: str, query: str, max_sentences: int = 20,) -> str: """Keep only the most relevant sentences to the query.""" sentences = document.split(". ") query_embedding = model.encode(query, convert_to_tensor=True) sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
scores = util.cos_sim(query_embedding, sentence_embeddings)[0] top_indices = scores.argsort(descending=True)[:max_sentences] # Preserve original order for coherence top_indices = sorted(top_indices.tolist())
return ". ".join(sentences[i] for i in top_indices)2. Map-reduce summarization — summarize chunks, then summarize summaries
from openai import OpenAI
client = OpenAI()
def map_reduce_summarize( chunks: list[str], final_query: str, model: str = "gpt-4o-mini",) -> str: """Summarize each chunk, then combine summaries.""" # Map: summarize each chunk independently summaries = [] for chunk in chunks: response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "Summarize the key facts in 3-5 bullet points."}, {"role": "user", "content": chunk}, ], max_tokens=300, ) summaries.append(response.choices[0].message.content)
# Reduce: combine all summaries and answer the question combined = "\n\n".join(summaries) response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "Answer the question using only the provided summaries."}, {"role": "user", "content": f"Summaries:\n{combined}\n\nQuestion: {final_query}"}, ], max_tokens=1_000, ) return response.choices[0].message.content3. Sliding window — process document in overlapping chunks
Use when you need to analyze a document sequentially (e.g., extracting all entities, checking compliance across sections). Each window overlaps with the previous by 10-20% to avoid missing information at boundaries.
4. Selective retrieval (RAG) — retrieve only what you need
The most cost-effective approach for production. Instead of stuffing the full document into context, embed it, store in a vector database, and retrieve only the 5-10 most relevant chunks per query. See RAG Architecture for the full implementation guide.
6. Comparison and Trade-offs
Section titled “6. Comparison and Trade-offs”Short context windows reduce cost and latency but require RAG; long context windows simplify architecture but become expensive at scale.
Short Context vs Long Context
Section titled “Short Context vs Long Context”📊 Visual Explanation
Section titled “📊 Visual Explanation”Short Context vs Long Context — Cost and Performance Trade-offs
- Low cost per request ($0.001-0.01)
- Fast response times (sub-second)
- Forces better prompt engineering
- Cannot process long documents
- Requires chunking and retrieval (RAG)
- Process entire documents at once
- Simpler architecture (no RAG needed for some use cases)
- Better cross-document reasoning
- 10-100x higher cost per request
- Slower response times (3-10s for large contexts)
Decision Framework: Long Context vs RAG
Section titled “Decision Framework: Long Context vs RAG”Use this table when choosing your architecture:
| Factor | Choose Long Context | Choose RAG |
|---|---|---|
| Request volume | <1,000/day | >10,000/day |
| Document size | <50 pages | Any size |
| Latency requirement | Flexible (5-30s OK) | Strict (<2s) |
| Accuracy need | Cross-document reasoning | Single-fact retrieval |
| Budget | Generous ($10K+/mo) | Tight (<$2K/mo) |
| Architecture complexity | Minimal (no vector DB) | Higher (vector DB + embeddings) |
| Update frequency | Documents change rarely | Documents change daily |
The hybrid approach: Use RAG to retrieve relevant chunks, then pass them into a moderately sized context (16K-32K). You get the cost benefits of RAG with enough context for cross-reference reasoning. This is what most production systems use.
7. Practical Implementation
Section titled “7. Practical Implementation”Chat history, context monitoring, and cost estimation are the three recurring implementation patterns every production LLM application must solve.
Conversation History Management
Section titled “Conversation History Management”Chat applications face a unique context budget challenge: every message in the conversation history eats into the window. A 50-message conversation can consume 10K-20K tokens before you add any RAG context.
def manage_conversation_history( messages: list[dict], max_history_tokens: int = 8_000, model: str = "gpt-4o",) -> list[dict]: """Keep recent messages within token budget.
Strategy: always keep system message + last 2 messages. Fill remaining budget with older messages, newest first. """ if not messages: return messages
system_msgs = [m for m in messages if m["role"] == "system"] non_system = [m for m in messages if m["role"] != "system"]
# Always include last 2 exchanges required = non_system[-2:] if len(non_system) >= 2 else non_system[:] optional = non_system[:-2] if len(non_system) > 2 else []
budget = max_history_tokens for msg in system_msgs + required: budget -= count_tokens(msg["content"], model)
# Add older messages from most recent, stop when budget runs out included_optional = [] for msg in reversed(optional): msg_tokens = count_tokens(msg["content"], model) if msg_tokens > budget: break included_optional.insert(0, msg) budget -= msg_tokens
return system_msgs + included_optional + requiredContext Window Monitoring in Production
Section titled “Context Window Monitoring in Production”Track these metrics to catch budget overflows before they hit your users:
import timefrom dataclasses import dataclass
@dataclassclass ContextMetrics: request_id: str total_input_tokens: int output_tokens: int context_utilization: float # input_tokens / max_context rag_chunks_used: int rag_chunks_dropped: int latency_ms: float estimated_cost_usd: float
def log_context_metrics( request_id: str, budget: dict, output_tokens: int, latency_ms: float, max_context: int = 128_000, input_price_per_m: float = 2.50, output_price_per_m: float = 10.00,) -> ContextMetrics: """Log token usage metrics for monitoring dashboards.""" input_cost = (budget["total_input"] / 1_000_000) * input_price_per_m output_cost = (output_tokens / 1_000_000) * output_price_per_m
metrics = ContextMetrics( request_id=request_id, total_input_tokens=budget["total_input"], output_tokens=output_tokens, context_utilization=budget["total_input"] / max_context, rag_chunks_used=budget["rag_chunks_used"], rag_chunks_dropped=budget["rag_chunks_dropped"], latency_ms=latency_ms, estimated_cost_usd=round(input_cost + output_cost, 6), )
# Alert if context utilization exceeds 80% if metrics.context_utilization > 0.8: print(f"WARNING: Context {metrics.context_utilization:.0%} full " f"for request {request_id}")
# Alert if chunks are being dropped if metrics.rag_chunks_dropped > 0: print(f"INFO: Dropped {metrics.rag_chunks_dropped} RAG chunks " f"for request {request_id}")
return metricsCost Estimation Calculator
Section titled “Cost Estimation Calculator”def estimate_monthly_cost( avg_input_tokens: int, avg_output_tokens: int, requests_per_day: int, input_price_per_m: float = 2.50, output_price_per_m: float = 10.00,) -> dict: """Estimate monthly LLM costs based on usage patterns.""" daily_input_cost = (avg_input_tokens * requests_per_day / 1_000_000) * input_price_per_m daily_output_cost = (avg_output_tokens * requests_per_day / 1_000_000) * output_price_per_m daily_total = daily_input_cost + daily_output_cost monthly_total = daily_total * 30
return { "daily_input_cost": round(daily_input_cost, 2), "daily_output_cost": round(daily_output_cost, 2), "daily_total": round(daily_total, 2), "monthly_total": round(monthly_total, 2), "cost_per_request": round(daily_total / requests_per_day, 6), }
# Example: Customer support botsupport_bot = estimate_monthly_cost( avg_input_tokens=4_000, # system + history + RAG avg_output_tokens=500, # short responses requests_per_day=10_000,)print(f"Support bot: ${support_bot['monthly_total']:,.0f}/month")# → Support bot: $4,500/month
# Example: Document analysis pipelinedoc_analysis = estimate_monthly_cost( avg_input_tokens=50_000, # full document in context avg_output_tokens=2_000, # detailed analysis requests_per_day=500,)print(f"Doc analysis: ${doc_analysis['monthly_total']:,.0f}/month")# → Doc analysis: $2,175/month8. Context Windows Interview Questions
Section titled “8. Context Windows Interview Questions”Context window questions test whether you can calculate costs on the spot and articulate when to use long context versus RAG — skills that signal production experience.
What Interviewers Expect
Section titled “What Interviewers Expect”Context window questions test whether you understand the cost and performance implications of LLM architecture decisions. Senior candidates should be able to calculate costs on the spot and articulate when to use long context vs RAG.
Strong vs Weak Answer Patterns
Section titled “Strong vs Weak Answer Patterns”Q: “Your LLM application processes 100-page legal documents. How would you design the context management?”
Weak: “I would use a model with a large context window and put the whole document in.”
Strong: “First, I would calculate the token count — 100 pages is roughly 50,000 tokens. With GPT-4o at $2.50/M input, that is $0.125 per request just for input. If we expect 1,000 requests per day, that is $3,750/month on input alone. I would implement a hybrid approach: use an embedding model to chunk and index the document, retrieve the 10 most relevant chunks per query using RAG, and pass those into a 16K context window. This drops the average input to about 8,000 tokens — a 6x cost reduction. For questions that require cross-section reasoning, like ‘does clause 4.2 conflict with clause 9.1,’ I would fall back to the full context approach but use Gemini 1.5 Flash at $0.075/M tokens to keep costs manageable.”
Q: “How do you handle the ‘lost in the middle’ problem?”
Weak: “We put important information at the beginning.”
Strong: “The Liu et al. research showed that models perform 20-30% worse on information placed in the middle of long contexts. I handle this three ways. First, structural ordering — system instructions and critical context go at the beginning, the user’s current query goes at the end, and supporting context fills the middle. Second, I use a relevance-scored retrieval system that ranks chunks by importance and places the highest-scored chunks at position boundaries. Third, for critical applications, I implement a verification step where the model is asked to explicitly cite which sections it used — if it cannot cite middle-positioned content, I know I need to restructure.”
Common Interview Questions
Section titled “Common Interview Questions”- Calculate the monthly cost of running a customer support chatbot with 50,000 daily conversations at an average of 8 messages each
- How would you reduce token costs by 80% without sacrificing response quality?
- Explain the quadratic attention cost and how it affects pricing
- Compare the cost-effectiveness of GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro for a document analysis use case
- Design a conversation history management system that stays within a 16K token budget
- When would you choose long context over RAG? Give three specific scenarios with cost justifications
9. Context Windows in Production
Section titled “9. Context Windows in Production”Three patterns — graceful degradation, tiered model routing, and context caching — can reduce production token costs by 40–90% without changing application behavior for end users.
Production Token Budgeting Patterns
Section titled “Production Token Budgeting Patterns”Pattern 1: Fixed Budget with Graceful Degradation
Set a hard token budget per request. When content exceeds it, degrade gracefully:
- First, drop the oldest conversation history messages
- Then, reduce RAG chunks from 10 to 5
- Then, summarize remaining RAG chunks before injection
- Finally, truncate to fit (last resort — log a warning)
Pattern 2: Tiered Models by Context Need
Route requests to different models based on context requirements:
- <4K tokens: GPT-4o mini ($0.15/M) — fast and cheap
- 4K-32K tokens: GPT-4o ($2.50/M) — best quality-to-price
- 32K-200K tokens: Claude 3.5 Sonnet ($3.00/M) — strong long-context performance
- 200K+ tokens: Gemini 1.5 Pro ($1.25/M) — only option at this scale
This alone can cut costs 40-60% compared to routing everything through one model.
Pattern 3: Context Caching for Repeated Prefixes
Anthropic and Google both offer prompt caching. If your system prompt + few-shot examples are the same across requests (they usually are), cache them:
- Anthropic cache pricing: 90% discount on cached tokens. A 3,000-token system prompt used 10,000 times/day saves ~$0.68/day ($20/month)
- Google cache pricing: Similar discount for prefixes. Gemini 1.5 Pro with cached prefix drops effective input cost to ~$0.31/M tokens
Cost Control Checklist
Section titled “Cost Control Checklist”Before deploying any LLM-powered feature to production, verify:
- Token budget defined for each component (system, user, RAG, output)
- Maximum output tokens capped (
max_tokensparameter set) - Conversation history has a sliding window with token limit
- RAG chunks are ranked by relevance and trimmed to fit budget
- Context utilization monitored with alerts at 80% and 95%
- Per-request cost logged and aggregated in billing dashboard
- Context caching enabled for static prefixes (system prompt, examples)
- Fallback model configured for overflow scenarios
- Monthly cost projection reviewed before launch
- Rate limiting enforced to prevent cost spikes from abuse
Scaling Cost Projections
Section titled “Scaling Cost Projections”| Growth Scenario | 1K req/day | 10K req/day | 100K req/day | 1M req/day |
|---|---|---|---|---|
| Small context (4K in, 500 out) | $15/mo | $150/mo | $1,500/mo | $15,000/mo |
| Medium context (16K in, 2K out) | $51/mo | $510/mo | $5,100/mo | $51,000/mo |
| Large context (64K in, 4K out) | $168/mo | $1,680/mo | $16,800/mo | $168,000/mo |
| Full context (128K in, 8K out) | $320/mo | $3,200/mo | $32,000/mo | $320,000/mo |
Based on GPT-4o pricing ($2.50/M input, $10/M output). Multiply by 0.06 for GPT-4o mini.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”Use these reference tables and checklists to make context window decisions quickly: which provider, which strategy, and what to verify before shipping.
The Decision in 30 Seconds
Section titled “The Decision in 30 Seconds”| Question | Answer |
|---|---|
| How much does filling a 128K window cost? | $0.32 (GPT-4o), $0.60 (Claude 3.5 Sonnet), $0.075 (Gemini Flash) |
| Should I use long context or RAG? | RAG for >10K req/day or tight budgets. Long context for deep analysis tasks |
| What is the biggest cost mistake? | Not setting max_tokens on output. A runaway response can 10x your cost |
| How do I cut costs without quality loss? | Cache static prefixes, tier models by context need, use RAG to shrink input |
| Does position in context matter? | Yes. Start and end positions get 85-95% accuracy. Middle drops to 55-70% |
Official Documentation
Section titled “Official Documentation”- OpenAI Tokenizer — Interactive token counter for GPT models
- Anthropic Token Counting — Claude tokenizer and context window docs
- Google AI Context Caching — Gemini context caching guide
- tiktoken (Python) — Fast BPE tokenizer for OpenAI models
- Lost in the Middle (Liu et al.) — Research on positional bias in long contexts
Related
Section titled “Related”- RAG Architecture — Build retrieval pipelines that keep context lean and costs low
- Prompt Engineering — Write system prompts that maximize quality per token
- LLM Evaluation — Measure quality across different context sizes and strategies
- Fine-Tuning vs RAG — When to bake knowledge into the model vs retrieving at runtime
- GenAI System Design — Full production architecture including token budget management
Last updated: March 2026. Token pricing changes frequently; verify current rates against provider documentation before making architecture decisions.
Frequently Asked Questions
How much does it cost to fill a 128K context window?
At GPT-4o pricing of $2.50 per million input tokens, filling a 128K context window costs $0.32 per request. At 10,000 requests per day, that is $3,200 daily or $96,000 per month on input tokens alone. Token budgeting — spending each token where it matters most — is what separates production-ready GenAI engineers from tutorial-followers.
What are the largest context windows available in 2026?
Gemini 1.5 Pro offers the largest at 2 million tokens, capable of processing entire codebases or book-length documents. Claude 3.5 Sonnet provides 200K tokens with the best quality-per-token ratio for long context. GPT-4o supports 128K tokens with the fastest price drops. Llama 3.1 matches 128K tokens as an open-weight model you can self-host to eliminate per-token costs.
How do you reduce LLM context costs without losing quality?
Use RAG to extract only relevant sections instead of stuffing entire documents. Apply context compression techniques to condense information. Use context caching from Anthropic and Google to cache repeated prefixes at 90% discount. Allocate token budgets across system prompt, user input, RAG context, and output. One fintech startup cut costs from $30,600 to $4,100 monthly — an 87% reduction — with no measurable quality loss.
Why do context windows have limits?
Context windows are limited because of how transformer attention works. Each token attends to every other token in the window, so attention computation scales quadratically — doubling context from 64K to 128K quadruples the attention cost. This is why larger context windows cost more per token and why techniques like sparse attention are being researched to handle 10M+ tokens.
What is the token budgeting framework for LLM requests?
Every LLM request has four token consumers: system prompt, user input, RAG context, and reserved output. For a 128K-token window, a practical budget allocates roughly 2,000 tokens for the system prompt, 3,000 for few-shot examples, 40,000 for RAG context, 15,000 for user input, 4,000 for safety buffer, and 64,000 (50%) reserved for output. The most common mistake is not reserving enough output tokens.
What is the lost in the middle problem in LLMs?
Research from Liu et al. (2023) showed that LLMs perform best on information at the beginning and end of the context window but degrade on information in the middle. Retrieval accuracy is 85-95% for the first 10% of context, drops to 55-70% in the middle, and recovers to 80-90% in the last 10%. Put your most important information at the start and the user query at the end.
When should I use long context windows versus RAG?
Use RAG for high-volume applications with more than 10,000 requests per day or tight budgets under $2K per month. Use long context for deep analysis tasks like legal review or code understanding where cross-document reasoning matters. The hybrid approach — retrieving relevant chunks via RAG and passing them into a 16K-32K context — is what most production systems use.
How does conversation history affect context window costs?
Every message in the conversation history consumes context window tokens. A 50-message conversation can use 10K-20K tokens before adding any RAG context. Manage this with a sliding window that keeps only the last N turns, summarization of older turns into compressed summaries, or selective retention that keeps high-information turns and drops filler and acknowledgements.
What is context caching and how much does it save?
Context caching from Anthropic and Google lets you cache repeated prefixes like system prompts at roughly 90% discount on token costs. A 3,000-token system prompt used 10,000 times per day saves about $20 per month. Google offers similar discounts for Gemini 1.5 Pro, dropping effective input cost to approximately $0.31 per million tokens for cached prefixes.
How do I monitor context window usage in production?
Track context utilization (input tokens divided by max context), RAG chunks used versus dropped, per-request cost, and latency. Set alerts at 80% and 95% context utilization. Log per-request costs and aggregate them in a billing dashboard. Monitor chunks dropped per request to detect when your RAG retrieval returns more content than the budget allows.