Context Windows & Cost — Token Budgeting Guide (2026)

This context windows guide covers everything you need to manage token budgets in production LLM applications. You will learn exactly how much each provider charges, how to allocate tokens across system prompts, user input, RAG context, and output — and how to compress context when the window runs out before your wallet does.

1. Why Context Windows Matter

Context windows are both a cost lever and a technical constraint: every token costs money, every token beyond the limit is silently dropped, and engineers who ignore this ship applications that are too expensive or too broken to run.

Why Context Windows Are a Cost Problem, Not Just a Technical Limit

128K tokens sounds unlimited. It is not. At GPT-4o’s input pricing of $2.50 per million tokens, filling a 128K context window costs $0.32 per request. Run that 10,000 times a day and you are spending $3,200 daily — $96,000 per month — on input tokens alone.

Context windows define how much text a model can “see” in a single request. Every token in that window costs money. Every token beyond it gets silently dropped. Engineers who treat context windows as infinite end up with applications that are either too expensive to run or too broken to ship.

The skill is not filling the context window. The skill is spending each token where it matters most. That is token budgeting — and it separates production-ready GenAI engineers from tutorial-followers.

For how context windows fit into the broader production architecture, see GenAI System Design.

2. What’s New in 2026

Development	Impact
Gemini 1.5 Pro (2M tokens)	Largest production context window. Process entire codebases or book-length documents in one request
Claude 3.5 Sonnet (200K tokens)	Best quality-per-token ratio for long context. Extended thinking uses additional tokens
GPT-4o (128K tokens)	Fastest price drops — $2.50/M input, $10/M output. 50% cheaper than 6 months ago
Llama 3.1 (128K tokens)	Open-weight model matching proprietary context lengths. Self-host to eliminate per-token cost
Context caching (Anthropic, Google)	Cache repeated prefixes at 90% discount. System prompts and few-shot examples nearly free on repeat calls
Sparse attention models	Research models handle 10M+ tokens. Not yet production-ready but closing the gap

3. Real-World Problem Context

A naive context strategy — stuffing entire documents into every request — routinely produces five-figure monthly bills that token budgeting can reduce by 80–90% with no quality loss.

The Bill That Changed Everything

A fintech startup built a document analysis feature using GPT-4o. Each customer uploaded 50-page financial reports. The naive approach: stuff the entire document into the context window, ask the model to summarize.

The math: 50 pages is roughly 25,000 tokens. At $2.50/M input + $10/M output (500 token response), each request costs $0.068. With 5,000 daily users averaging 3 reports each, the monthly bill hit $30,600 — just for this one feature.

After implementing token budgeting (extracting only relevant sections via RAG, compressing context, caching system prompts), they cut costs to $4,100/month — an 87% reduction with no measurable quality loss.

This is not an edge case. Every production LLM application faces this trade-off between context size, quality, and cost.

How Transformers Use Context

Context windows exist because of how transformer attention works. Each token attends to every other token in the window. That means attention computation scales quadratically: doubling the context from 64K to 128K quadruples the attention cost.

Modern models use optimizations like Flash Attention, sliding window attention, and grouped query attention (GQA) to reduce this cost. But the fundamental constraint remains: more context = more compute = higher latency = higher price.

Context Length	Relative Attention Cost	Typical Latency
4K tokens	1x (baseline)	<1s
16K tokens	16x	1-2s
64K tokens	256x	3-6s
128K tokens	1,024x	5-15s
1M tokens	62,500x	30-60s

These numbers explain why providers charge more for longer contexts. You are paying for GPU time, not just storage.

4. Context Windows Core Concepts

Every LLM request has four token consumers — system prompt, user input, RAG context, and reserved output — and each must have an explicit budget before you write a line of application code.

Token Budgeting Framework

Every LLM request has four token consumers. Your budget must account for all of them:

Total Context Window = System Prompt + User Input + RAG Context + Reserved Output

Here is a practical budget for a 128K-token window:

Component	Token Allocation	Percentage	Purpose
System prompt	2,000	1.6%	Instructions, persona, format rules
Few-shot examples	3,000	2.3%	Quality anchors (2-3 examples)
RAG context	40,000	31.3%	Retrieved documents and chunks
User input	15,000	11.7%	Current query + conversation history
Safety buffer	4,000	3.1%	Overflow protection
Reserved output	64,000	50%	Model response generation

The most common mistake: not reserving enough output tokens. If you fill 120K of a 128K window with input, the model can only generate 8K tokens of output. Long analyses, code generation, and structured responses need 16K-64K output tokens.

Cost Per Provider (March 2026)

Provider / Model	Context Window	Input ($/M tokens)	Output ($/M tokens)	Cost to Fill Window (input only)
GPT-4o	128K	$2.50	$10.00	$0.32
GPT-4o mini	128K	$0.15	$0.60	$0.019
Claude 3.5 Sonnet	200K	$3.00	$15.00	$0.60
Claude 3.5 Haiku	200K	$0.80	$4.00	$0.16
Gemini 1.5 Pro	2M	$1.25	$5.00	$2.50
Gemini 1.5 Flash	1M	$0.075	$0.30	$0.075
Llama 3.1 (self-hosted)	128K	~$0.20*	~$0.20*	$0.026

Self-hosted costs vary by GPU. Estimate based on A100 at $2/hr processing ~10M tokens/hr.

The “Lost in the Middle” Problem

Models do not treat all positions in the context window equally. Research from Liu et al. (2023) showed that LLMs perform best on information at the beginning and end of the context, but degrade on information in the middle.

In a 128K context window, the accuracy difference is significant:

Information Position	Retrieval Accuracy
First 10% of context	85-95%
Middle 40-60% of context	55-70%
Last 10% of context	80-90%

Practical implication: Put your most important information (system instructions, critical context) at the start. Put the user’s current question at the end. Treat the middle as lower-priority space.

5. Technical Deep Dive

Accurate token counting and context compression are the two implementation skills that turn a token budget from a spreadsheet estimate into enforced per-request limits.

Token Counting with tiktoken

Before you can budget tokens, you need to count them. Python’s tiktoken library gives you exact counts for OpenAI models. For Claude, use Anthropic’s tokenizer. For rough estimates, 1 token is approximately 4 characters in English.

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given model."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def budget_context(
    system_prompt: str,
    user_input: str,
    rag_chunks: list[str],
    model: str = "gpt-4o",
    max_context: int = 128_000,
    reserved_output: int = 4_096,
) -> dict:
    """Calculate token budget and trim RAG context if needed."""
    system_tokens = count_tokens(system_prompt, model)
    user_tokens = count_tokens(user_input, model)
    safety_buffer = 500

    available_for_rag = (
        max_context - system_tokens - user_tokens
        - reserved_output - safety_buffer
    )

    selected_chunks = []
    used_tokens = 0
    for chunk in rag_chunks:
        chunk_tokens = count_tokens(chunk, model)
        if used_tokens + chunk_tokens > available_for_rag:
            break
        selected_chunks.append(chunk)
        used_tokens += chunk_tokens

    return {
        "system_tokens": system_tokens,
        "user_tokens": user_tokens,
        "rag_tokens": used_tokens,
        "rag_chunks_used": len(selected_chunks),
        "rag_chunks_dropped": len(rag_chunks) - len(selected_chunks),
        "reserved_output": reserved_output,
        "total_input": system_tokens + user_tokens + used_tokens,
        "remaining_buffer": available_for_rag - used_tokens + safety_buffer,
    }

# Example usage
budget = budget_context(
    system_prompt="You are a helpful financial analyst...",
    user_input="Summarize the Q3 revenue trends from this report.",
    rag_chunks=["Revenue in Q3 grew by 12%...", "Operating costs..."],
)
print(f"Total input tokens: {budget['total_input']:,}")
print(f"RAG chunks used: {budget['rag_chunks_used']}")
print(f"Chunks dropped: {budget['rag_chunks_dropped']}")

Context Compression Techniques

When your content exceeds the budget, you have four strategies:

1. Extractive compression — keep only relevant sentences

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def compress_by_relevance(
    document: str,
    query: str,
    max_sentences: int = 20,
) -> str:
    """Keep only the most relevant sentences to the query."""
    sentences = document.split(". ")
    query_embedding = model.encode(query, convert_to_tensor=True)
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

    scores = util.cos_sim(query_embedding, sentence_embeddings)[0]
    top_indices = scores.argsort(descending=True)[:max_sentences]
    # Preserve original order for coherence
    top_indices = sorted(top_indices.tolist())

    return ". ".join(sentences[i] for i in top_indices)

2. Map-reduce summarization — summarize chunks, then summarize summaries

from openai import OpenAI

client = OpenAI()

def map_reduce_summarize(
    chunks: list[str],
    final_query: str,
    model: str = "gpt-4o-mini",
) -> str:
    """Summarize each chunk, then combine summaries."""
    # Map: summarize each chunk independently
    summaries = []
    for chunk in chunks:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "Summarize the key facts in 3-5 bullet points."},
                {"role": "user", "content": chunk},
            ],
            max_tokens=300,
        )
        summaries.append(response.choices[0].message.content)

    # Reduce: combine all summaries and answer the question
    combined = "\n\n".join(summaries)
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Answer the question using only the provided summaries."},
            {"role": "user", "content": f"Summaries:\n{combined}\n\nQuestion: {final_query}"},
        ],
        max_tokens=1_000,
    )
    return response.choices[0].message.content

3. Sliding window — process document in overlapping chunks

Use when you need to analyze a document sequentially (e.g., extracting all entities, checking compliance across sections). Each window overlaps with the previous by 10-20% to avoid missing information at boundaries.

4. Selective retrieval (RAG) — retrieve only what you need

The most cost-effective approach for production. Instead of stuffing the full document into context, embed it, store in a vector database, and retrieve only the 5-10 most relevant chunks per query. See RAG Architecture for the full implementation guide.

6. Comparison and Trade-offs

Short context windows reduce cost and latency but require RAG; long context windows simplify architecture but become expensive at scale.

Short Context vs Long Context

📊 Visual Explanation

Short Context vs Long Context — Cost and Performance Trade-offs

Short Context (4K-16K)

Fast, cheap, focused

Low cost per request ($0.001-0.01)
Fast response times (sub-second)
Forces better prompt engineering
Cannot process long documents
Requires chunking and retrieval (RAG)

Long Context (128K-1M+)

Powerful but expensive

Process entire documents at once
Simpler architecture (no RAG needed for some use cases)
Better cross-document reasoning
10-100x higher cost per request
Slower response times (3-10s for large contexts)

Verdict: Use short context with RAG for cost-sensitive production systems. Use long context for analysis tasks where cost is secondary to accuracy.

Use Short Context (4K-16K) when…

Customer support bots, FAQ systems, high-volume APIs

Use Long Context (128K-1M+) when…

Document analysis, legal review, code understanding, research

Decision Framework: Long Context vs RAG

Use this table when choosing your architecture:

Factor	Choose Long Context	Choose RAG
Request volume	<1,000/day	>10,000/day
Document size	<50 pages	Any size
Latency requirement	Flexible (5-30s OK)	Strict (<2s)
Accuracy need	Cross-document reasoning	Single-fact retrieval
Budget	Generous ($10K+/mo)	Tight (<$2K/mo)
Architecture complexity	Minimal (no vector DB)	Higher (vector DB + embeddings)
Update frequency	Documents change rarely	Documents change daily

The hybrid approach: Use RAG to retrieve relevant chunks, then pass them into a moderately sized context (16K-32K). You get the cost benefits of RAG with enough context for cross-reference reasoning. This is what most production systems use.

7. Practical Implementation

Chat history, context monitoring, and cost estimation are the three recurring implementation patterns every production LLM application must solve.

Conversation History Management

Chat applications face a unique context budget challenge: every message in the conversation history eats into the window. A 50-message conversation can consume 10K-20K tokens before you add any RAG context.

def manage_conversation_history(
    messages: list[dict],
    max_history_tokens: int = 8_000,
    model: str = "gpt-4o",
) -> list[dict]:
    """Keep recent messages within token budget.

    Strategy: always keep system message + last 2 messages.
    Fill remaining budget with older messages, newest first.
    """
    if not messages:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    # Always include last 2 exchanges
    required = non_system[-2:] if len(non_system) >= 2 else non_system[:]
    optional = non_system[:-2] if len(non_system) > 2 else []

    budget = max_history_tokens
    for msg in system_msgs + required:
        budget -= count_tokens(msg["content"], model)

    # Add older messages from most recent, stop when budget runs out
    included_optional = []
    for msg in reversed(optional):
        msg_tokens = count_tokens(msg["content"], model)
        if msg_tokens > budget:
            break
        included_optional.insert(0, msg)
        budget -= msg_tokens

    return system_msgs + included_optional + required

Context Window Monitoring in Production

Track these metrics to catch budget overflows before they hit your users:

import time
from dataclasses import dataclass

@dataclass
class ContextMetrics:
    request_id: str
    total_input_tokens: int
    output_tokens: int
    context_utilization: float  # input_tokens / max_context
    rag_chunks_used: int
    rag_chunks_dropped: int
    latency_ms: float
    estimated_cost_usd: float

def log_context_metrics(
    request_id: str,
    budget: dict,
    output_tokens: int,
    latency_ms: float,
    max_context: int = 128_000,
    input_price_per_m: float = 2.50,
    output_price_per_m: float = 10.00,
) -> ContextMetrics:
    """Log token usage metrics for monitoring dashboards."""
    input_cost = (budget["total_input"] / 1_000_000) * input_price_per_m
    output_cost = (output_tokens / 1_000_000) * output_price_per_m

    metrics = ContextMetrics(
        request_id=request_id,
        total_input_tokens=budget["total_input"],
        output_tokens=output_tokens,
        context_utilization=budget["total_input"] / max_context,
        rag_chunks_used=budget["rag_chunks_used"],
        rag_chunks_dropped=budget["rag_chunks_dropped"],
        latency_ms=latency_ms,
        estimated_cost_usd=round(input_cost + output_cost, 6),
    )

    # Alert if context utilization exceeds 80%
    if metrics.context_utilization > 0.8:
        print(f"WARNING: Context {metrics.context_utilization:.0%} full "
              f"for request {request_id}")

    # Alert if chunks are being dropped
    if metrics.rag_chunks_dropped > 0:
        print(f"INFO: Dropped {metrics.rag_chunks_dropped} RAG chunks "
              f"for request {request_id}")

    return metrics

Cost Estimation Calculator

def estimate_monthly_cost(
    avg_input_tokens: int,
    avg_output_tokens: int,
    requests_per_day: int,
    input_price_per_m: float = 2.50,
    output_price_per_m: float = 10.00,
) -> dict:
    """Estimate monthly LLM costs based on usage patterns."""
    daily_input_cost = (avg_input_tokens * requests_per_day / 1_000_000) * input_price_per_m
    daily_output_cost = (avg_output_tokens * requests_per_day / 1_000_000) * output_price_per_m
    daily_total = daily_input_cost + daily_output_cost
    monthly_total = daily_total * 30

    return {
        "daily_input_cost": round(daily_input_cost, 2),
        "daily_output_cost": round(daily_output_cost, 2),
        "daily_total": round(daily_total, 2),
        "monthly_total": round(monthly_total, 2),
        "cost_per_request": round(daily_total / requests_per_day, 6),
    }

# Example: Customer support bot
support_bot = estimate_monthly_cost(
    avg_input_tokens=4_000,    # system + history + RAG
    avg_output_tokens=500,     # short responses
    requests_per_day=10_000,
)
print(f"Support bot: ${support_bot['monthly_total']:,.0f}/month")
# → Support bot: $4,500/month

# Example: Document analysis pipeline
doc_analysis = estimate_monthly_cost(
    avg_input_tokens=50_000,   # full document in context
    avg_output_tokens=2_000,   # detailed analysis
    requests_per_day=500,
)
print(f"Doc analysis: ${doc_analysis['monthly_total']:,.0f}/month")
# → Doc analysis: $2,175/month

8. Context Windows Interview Questions

Context window questions test whether you can calculate costs on the spot and articulate when to use long context versus RAG — skills that signal production experience.

What Interviewers Expect

Context window questions test whether you understand the cost and performance implications of LLM architecture decisions. Senior candidates should be able to calculate costs on the spot and articulate when to use long context vs RAG.

Strong vs Weak Answer Patterns

Q: “Your LLM application processes 100-page legal documents. How would you design the context management?”

Weak: “I would use a model with a large context window and put the whole document in.”

Strong: “First, I would calculate the token count — 100 pages is roughly 50,000 tokens. With GPT-4o at $2.50/M input, that is $0.125 per request just for input. If we expect 1,000 requests per day, that is $3,750/month on input alone. I would implement a hybrid approach: use an embedding model to chunk and index the document, retrieve the 10 most relevant chunks per query using RAG, and pass those into a 16K context window. This drops the average input to about 8,000 tokens — a 6x cost reduction. For questions that require cross-section reasoning, like ‘does clause 4.2 conflict with clause 9.1,’ I would fall back to the full context approach but use Gemini 1.5 Flash at $0.075/M tokens to keep costs manageable.”

Q: “How do you handle the ‘lost in the middle’ problem?”

Weak: “We put important information at the beginning.”

Strong: “The Liu et al. research showed that models perform 20-30% worse on information placed in the middle of long contexts. I handle this three ways. First, structural ordering — system instructions and critical context go at the beginning, the user’s current query goes at the end, and supporting context fills the middle. Second, I use a relevance-scored retrieval system that ranks chunks by importance and places the highest-scored chunks at position boundaries. Third, for critical applications, I implement a verification step where the model is asked to explicitly cite which sections it used — if it cannot cite middle-positioned content, I know I need to restructure.”

Common Interview Questions

Calculate the monthly cost of running a customer support chatbot with 50,000 daily conversations at an average of 8 messages each
How would you reduce token costs by 80% without sacrificing response quality?
Explain the quadratic attention cost and how it affects pricing
Compare the cost-effectiveness of GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro for a document analysis use case
Design a conversation history management system that stays within a 16K token budget
When would you choose long context over RAG? Give three specific scenarios with cost justifications

9. Context Windows in Production

Three patterns — graceful degradation, tiered model routing, and context caching — can reduce production token costs by 40–90% without changing application behavior for end users.

Production Token Budgeting Patterns

Pattern 1: Fixed Budget with Graceful Degradation

Set a hard token budget per request. When content exceeds it, degrade gracefully:

First, drop the oldest conversation history messages
Then, reduce RAG chunks from 10 to 5
Then, summarize remaining RAG chunks before injection
Finally, truncate to fit (last resort — log a warning)

Pattern 2: Tiered Models by Context Need

Route requests to different models based on context requirements:

<4K tokens: GPT-4o mini ($0.15/M) — fast and cheap
4K-32K tokens: GPT-4o ($2.50/M) — best quality-to-price
32K-200K tokens: Claude 3.5 Sonnet ($3.00/M) — strong long-context performance
200K+ tokens: Gemini 1.5 Pro ($1.25/M) — only option at this scale

This alone can cut costs 40-60% compared to routing everything through one model.

Pattern 3: Context Caching for Repeated Prefixes

Anthropic and Google both offer prompt caching. If your system prompt + few-shot examples are the same across requests (they usually are), cache them:

Anthropic cache pricing: 90% discount on cached tokens. A 3,000-token system prompt used 10,000 times/day saves ~$0.68/day ($20/month)
Google cache pricing: Similar discount for prefixes. Gemini 1.5 Pro with cached prefix drops effective input cost to ~$0.31/M tokens

Cost Control Checklist

Before deploying any LLM-powered feature to production, verify:

Scaling Cost Projections

Growth Scenario	1K req/day	10K req/day	100K req/day	1M req/day
Small context (4K in, 500 out)	$15/mo	$150/mo	$1,500/mo	$15,000/mo
Medium context (16K in, 2K out)	$51/mo	$510/mo	$5,100/mo	$51,000/mo
Large context (64K in, 4K out)	$168/mo	$1,680/mo	$16,800/mo	$168,000/mo
Full context (128K in, 8K out)	$320/mo	$3,200/mo	$32,000/mo	$320,000/mo

Based on GPT-4o pricing ($2.50/M input, $10/M output). Multiply by 0.06 for GPT-4o mini.

10. Summary and Key Takeaways

Use these reference tables and checklists to make context window decisions quickly: which provider, which strategy, and what to verify before shipping.

The Decision in 30 Seconds

Question	Answer
How much does filling a 128K window cost?	$0.32 (GPT-4o), $0.60 (Claude 3.5 Sonnet), $0.075 (Gemini Flash)
Should I use long context or RAG?	RAG for >10K req/day or tight budgets. Long context for deep analysis tasks
What is the biggest cost mistake?	Not setting `max_tokens` on output. A runaway response can 10x your cost
How do I cut costs without quality loss?	Cache static prefixes, tier models by context need, use RAG to shrink input
Does position in context matter?	Yes. Start and end positions get 85-95% accuracy. Middle drops to 55-70%

Official Documentation

OpenAI Tokenizer — Interactive token counter for GPT models
Anthropic Token Counting — Claude tokenizer and context window docs
Google AI Context Caching — Gemini context caching guide
tiktoken (Python) — Fast BPE tokenizer for OpenAI models
Lost in the Middle (Liu et al.) — Research on positional bias in long contexts

RAG Architecture — Build retrieval pipelines that keep context lean and costs low
Prompt Engineering — Write system prompts that maximize quality per token
LLM Evaluation — Measure quality across different context sizes and strategies
Fine-Tuning vs RAG — When to bake knowledge into the model vs retrieving at runtime
GenAI System Design — Full production architecture including token budget management

Last updated: March 2026. Token pricing changes frequently; verify current rates against provider documentation before making architecture decisions.

Frequently Asked Questions

How much does it cost to fill a 128K context window?

At GPT-4o pricing of $2.50 per million input tokens, filling a 128K context window costs $0.32 per request. At 10,000 requests per day, that is $3,200 daily or $96,000 per month on input tokens alone. Token budgeting — spending each token where it matters most — is what separates production-ready GenAI engineers from tutorial-followers.

What are the largest context windows available in 2026?

Gemini 1.5 Pro offers the largest at 2 million tokens, capable of processing entire codebases or book-length documents. Claude 3.5 Sonnet provides 200K tokens with the best quality-per-token ratio for long context. GPT-4o supports 128K tokens with the fastest price drops. Llama 3.1 matches 128K tokens as an open-weight model you can self-host to eliminate per-token costs.

How do you reduce LLM context costs without losing quality?

Use RAG to extract only relevant sections instead of stuffing entire documents. Apply context compression techniques to condense information. Use context caching from Anthropic and Google to cache repeated prefixes at 90% discount. Allocate token budgets across system prompt, user input, RAG context, and output. One fintech startup cut costs from $30,600 to $4,100 monthly — an 87% reduction — with no measurable quality loss.

Why do context windows have limits?

Context windows are limited because of how transformer attention works. Each token attends to every other token in the window, so attention computation scales quadratically — doubling context from 64K to 128K quadruples the attention cost. This is why larger context windows cost more per token and why techniques like sparse attention are being researched to handle 10M+ tokens.

What is the token budgeting framework for LLM requests?

Every LLM request has four token consumers: system prompt, user input, RAG context, and reserved output. For a 128K-token window, a practical budget allocates roughly 2,000 tokens for the system prompt, 3,000 for few-shot examples, 40,000 for RAG context, 15,000 for user input, 4,000 for safety buffer, and 64,000 (50%) reserved for output. The most common mistake is not reserving enough output tokens.

What is the lost in the middle problem in LLMs?

Research from Liu et al. (2023) showed that LLMs perform best on information at the beginning and end of the context window but degrade on information in the middle. Retrieval accuracy is 85-95% for the first 10% of context, drops to 55-70% in the middle, and recovers to 80-90% in the last 10%. Put your most important information at the start and the user query at the end.

When should I use long context windows versus RAG?

Use RAG for high-volume applications with more than 10,000 requests per day or tight budgets under $2K per month. Use long context for deep analysis tasks like legal review or code understanding where cross-document reasoning matters. The hybrid approach — retrieving relevant chunks via RAG and passing them into a 16K-32K context — is what most production systems use.

How does conversation history affect context window costs?

Every message in the conversation history consumes context window tokens. A 50-message conversation can use 10K-20K tokens before adding any RAG context. Manage this with a sliding window that keeps only the last N turns, summarization of older turns into compressed summaries, or selective retention that keeps high-information turns and drops filler and acknowledgements.

What is context caching and how much does it save?

Context caching from Anthropic and Google lets you cache repeated prefixes like system prompts at roughly 90% discount on token costs. A 3,000-token system prompt used 10,000 times per day saves about $20 per month. Google offers similar discounts for Gemini 1.5 Pro, dropping effective input cost to approximately $0.31 per million tokens for cached prefixes.

How do I monitor context window usage in production?

Track context utilization (input tokens divided by max context), RAG chunks used versus dropped, per-request cost, and latency. Set alerts at 80% and 95% context utilization. Log per-request costs and aggregate them in a billing dashboard. Monitor chunks dropped per request to detect when your RAG retrieval returns more content than the budget allows.