LLM Cost Optimization — Token Management, Caching & Model Routing (2026)
GenAI engineers who understand LLM fundamentals and can build working RAG systems often hit a wall in production: the bill. A prototype that costs pennies a day can cost thousands per month at scale. LLM cost optimization is the discipline of making that economics work — maintaining quality while reducing cost by 60-90%. This page gives you the mental models, architecture patterns, and code to do it.
Who this is for:
- Backend and GenAI engineers shipping LLM-powered features to production who have received their first large API bill and need to fix it.
- Engineering managers evaluating whether a GenAI feature is economically viable at scale.
- Senior engineers preparing for interviews — cost optimization is a required topic for GenAI system design rounds at staff level and above.
- Architects designing multi-model pipelines who need to reason about cost vs. capability tradeoffs at each layer.
The Cost Problem — Why LLM Bills Spiral
Section titled “The Cost Problem — Why LLM Bills Spiral”LLM APIs price by token — both input tokens you send and output tokens the model generates. This creates a cost structure that surprises most engineers coming from traditional software.
In a conventional web service, adding a user or doubling traffic costs you compute and bandwidth. Both are cheap and predictable. In an LLM-powered service, the cost per request is determined by:
- Which model you call — GPT-4o costs roughly 30-50x more per token than GPT-4o-mini. Claude Sonnet costs roughly 10x more than Claude Haiku.
- How many input tokens you send — your system prompt, retrieved context, conversation history, and the user’s query all count.
- How many output tokens the model generates — longer, more detailed responses cost proportionally more.
- How often you call the API — every retry, every re-query, every chain step is a separate charge.
The Compounding Problem
Section titled “The Compounding Problem”These factors compound. A RAG system that sends a 2,000-token system prompt, retrieves 3,000 tokens of context, processes a 200-token user query, and generates an 800-token response uses 6,000 tokens per call. At GPT-4o pricing (roughly $0.0025/1K input + $0.01/1K output), that is roughly $0.023 per request. At 10,000 requests per day, that is $230/day — $6,900/month — from a single feature.
Now imagine that feature does three LLM calls per user interaction (retrieval, generation, summarization). $20,700/month. Now add a multi-agent workflow where agents call each other. Costs can reach six figures monthly for moderately popular applications.
Understanding context windows is especially important here: larger context windows do not just change what the model can see — they change what you pay. Stuffing a 100K-token context through GPT-4o on every request will bankrupt a product quickly.
Token Economics at a Glance
Section titled “Token Economics at a Glance”| Model | Input cost/1M tokens | Output cost/1M tokens | Relative cost |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Baseline |
| GPT-4o-mini | $0.15 | $0.60 | ~17x cheaper |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Similar to GPT-4o |
| Claude 3.5 Haiku | $0.80 | $4.00 | ~4x cheaper |
| Llama 3.1 70B (self-hosted) | $0.00 | $0.00 | Infra cost only |
Prices are approximate as of early 2026. Verify against official provider pricing pages before committing to an architecture.
The implication is direct: routing 70% of your requests to a model that is 17x cheaper — while preserving quality for the hard 30% — cuts total API spend by roughly 65% without users noticing.
Model Routing — The Highest-Leverage Optimization
Section titled “Model Routing — The Highest-Leverage Optimization”Model routing is the practice of classifying each incoming request by complexity, then directing it to the cheapest model capable of handling it. It is the single highest-leverage cost reduction technique because it addresses the root cause: paying expensive-model prices for tasks that do not require expensive-model capability.
The Tiered Model Pattern
Section titled “The Tiered Model Pattern”Most production systems use three tiers:
- Tier 1 (cheap): Simple classification, keyword extraction, format conversion, yes/no decisions. Models: GPT-4o-mini, Claude Haiku, Gemini Flash. Cost: $0.001-0.005 per call.
- Tier 2 (mid): Multi-step reasoning, summarization, structured data extraction, code explanation. Models: GPT-4o, Claude Sonnet, Gemini Pro. Cost: $0.01-0.05 per call.
- Tier 3 (expensive): Complex code generation, nuanced creative work, long-form analysis, tasks requiring world knowledge. Models: o1/o3, Claude Sonnet 3.7, Gemini Ultra. Cost: $0.05-0.50 per call.
The goal is to push as many requests as possible to Tier 1 while routing only genuinely hard tasks to Tier 3.
Building a Complexity Classifier
Section titled “Building a Complexity Classifier”A complexity classifier is itself an LLM call — but a very cheap one. You use a fast, inexpensive model to categorize the incoming request, then route based on that categorization.
# Requires: openai>=1.30.0from openai import OpenAIfrom enum import Enum
client = OpenAI()
class ComplexityTier(str, Enum): SIMPLE = "simple" # Tier 1: cheap model MEDIUM = "medium" # Tier 2: mid model COMPLEX = "complex" # Tier 3: expensive model
CLASSIFIER_PROMPT = """Classify the complexity of this user request. Reply with exactly one word:- "simple" — format conversion, yes/no answer, keyword extraction, basic lookup- "medium" — summarization, multi-step reasoning, code explanation, structured extraction- "complex" — original code generation, nuanced analysis, multi-document synthesis, mathematical reasoning
User request: {query}"""
MODEL_MAP = { ComplexityTier.SIMPLE: "gpt-4o-mini", ComplexityTier.MEDIUM: "gpt-4o", ComplexityTier.COMPLEX: "o1-mini",}
def classify_and_route(query: str, system_prompt: str, context: str = "") -> str: # Step 1: classify with cheapest model classification_response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}], max_tokens=10, temperature=0, ) tier_str = classification_response.choices[0].message.content.strip().lower()
# Default to medium if classifier output is unexpected try: tier = ComplexityTier(tier_str) except ValueError: tier = ComplexityTier.MEDIUM
target_model = MODEL_MAP[tier]
# Step 2: route to appropriate model messages = [{"role": "system", "content": system_prompt}] if context: messages.append({"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}) else: messages.append({"role": "user", "content": query})
response = client.chat.completions.create( model=target_model, messages=messages, )
return response.choices[0].message.contentRule-Based Routing (Zero Classifier Cost)
Section titled “Rule-Based Routing (Zero Classifier Cost)”For predictable workloads, skip the classifier entirely and route by signal:
def rule_based_router(query: str, context_length: int, has_code: bool) -> str: # Long context always needs capable model if context_length > 8000: return "gpt-4o"
# Code generation is complex if has_code and any(kw in query.lower() for kw in ["generate", "write", "implement", "create"]): return "gpt-4o"
# Short queries with no code are simple if len(query.split()) < 20 and not has_code: return "gpt-4o-mini"
return "gpt-4o" # default to mid tierRule-based routing has zero latency overhead and zero additional cost. It works well when your request types are predictable — customer support bots, document processing pipelines, structured data extraction.
Cost Optimization Pipeline
Section titled “Cost Optimization Pipeline”The diagram below shows how a production system routes requests through optimization layers before an expensive model call ever happens.
📊 Visual Explanation
Section titled “📊 Visual Explanation”LLM Cost Optimization Pipeline
Each layer catches a request before it reaches a more expensive tier. Most requests should resolve before reaching Tier 2 or Tier 3 models.
This pipeline architecture means expensive model calls are the last resort, not the default. A well-tuned pipeline sees 30-50% of requests resolved by cache and 40-60% of remaining requests handled by Tier 1 models.
Prompt Compression and Token Management
Section titled “Prompt Compression and Token Management”Even after routing to the right model, you can significantly reduce per-call costs by reducing the number of tokens you send. Prompt compression is the practice of stripping unnecessary tokens from your inputs without degrading output quality.
System Prompt Compression
Section titled “System Prompt Compression”System prompts are a common source of token waste. Engineers write system prompts in natural language for readability, but LLMs do not need grammatically correct prose. They respond equally well to compressed instruction formats.
Before (127 tokens):
You are a helpful customer support assistant for Acme Corp. Your job is to help customerswith their questions about our products and services. You should always be polite andprofessional. If you don't know the answer to a question, you should say so and offerto connect the customer with a human agent. Never make up information about productsor pricing. Always respond in the same language the customer uses.After (42 tokens):
Role: Acme Corp support agent.Rules: Polite, professional. Match user language.If unknown: say so, offer human escalation.Never fabricate product/pricing info.Same instructions. 67% fewer tokens. A system prompt running at 127 tokens across 10,000 daily requests costs 1.27M input tokens per day. At 42 tokens, that is 420K. The savings at GPT-4o pricing: roughly $2.13/day — $650/year from prompt compression alone.
Dynamic Context Pruning
Section titled “Dynamic Context Pruning”In RAG systems, retrieved context is frequently the largest token consumer. Standard RAG retrieves top-k chunks regardless of actual relevance. Dynamic pruning removes chunks below a relevance threshold before they enter the prompt.
# Requires: python>=3.9from typing import List, Tuple
def prune_retrieved_chunks( chunks: List[Tuple[str, float]], # (text, similarity_score) min_similarity: float = 0.75, max_tokens: int = 3000, approx_tokens_per_word: float = 1.3,) -> str: """Filter chunks by relevance and token budget, return joined context.""" filtered = [text for text, score in chunks if score >= min_similarity]
# Build context within token budget context_parts = [] token_count = 0
for chunk in filtered: chunk_tokens = int(len(chunk.split()) * approx_tokens_per_word) if token_count + chunk_tokens > max_tokens: break context_parts.append(chunk) token_count += chunk_tokens
return "\n\n".join(context_parts)Conversation History Truncation
Section titled “Conversation History Truncation”Agents and chatbots accumulate conversation history that grows without bound. Strategies to manage it:
- Sliding window: Keep only the last N turns. Simple, predictable token cost.
- Summarization compression: Periodically summarize older turns into a compact summary, replace them with the summary. Preserves context without accumulating raw tokens.
- Importance scoring: Score each message by relevance to the current query, drop low-scoring old messages.
def truncate_history_sliding_window( messages: list, max_turns: int = 10, always_keep_system: bool = True,) -> list: """Keep system message + last N user/assistant turn pairs.""" system_messages = [m for m in messages if m["role"] == "system"] conversation = [m for m in messages if m["role"] != "system"]
# Keep last max_turns * 2 messages (each turn = user + assistant) truncated_conversation = conversation[-(max_turns * 2):]
if always_keep_system: return system_messages + truncated_conversation return truncated_conversationCaching Strategies
Section titled “Caching Strategies”Caching is the highest-efficiency optimization available — a cache hit costs nothing and returns instantly. Three distinct caching patterns apply to LLM workloads.
Exact Match Cache
Section titled “Exact Match Cache”For deterministic, repeated queries — API documentation lookups, FAQ answers, status checks — exact string matching is sufficient. Store the query string as a hash key, the response as the value.
# Requires: redis>=5.0.0import hashlibimport jsonimport redis
class ExactMatchCache: def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600): self.cache = redis_client self.ttl = ttl_seconds
def _make_key(self, model: str, messages: list, temperature: float) -> str: payload = json.dumps({"model": model, "messages": messages, "temperature": temperature}, sort_keys=True) return f"llm:exact:{hashlib.sha256(payload.encode()).hexdigest()}"
def get(self, model: str, messages: list, temperature: float = 0.0) -> str | None: key = self._make_key(model, messages, temperature) cached = self.cache.get(key) return cached.decode() if cached else None
def set(self, model: str, messages: list, response: str, temperature: float = 0.0): key = self._make_key(model, messages, temperature) self.cache.setex(key, self.ttl, response)Exact match works best with temperature=0 calls. Caching non-deterministic outputs (temperature > 0) can return stale responses — use it only when the response is meant to be stable.
Semantic Cache
Section titled “Semantic Cache”Semantic caching matches queries by meaning, not exact text. A user asking “What is your return policy?” and another asking “Can I return something I bought?” should get the same cached answer.
The pattern uses embedding similarity:
# Requires: openai>=1.30.0, numpy>=1.26.0import numpy as npfrom openai import OpenAI
client = OpenAI()
class SemanticCache: def __init__(self, similarity_threshold: float = 0.92): self.entries: list[dict] = [] # In production: use a vector DB self.threshold = similarity_threshold
def _embed(self, text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text, ) return response.data[0].embedding
def _cosine_similarity(self, a: list[float], b: list[float]) -> float: a_arr, b_arr = np.array(a), np.array(b) return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def get(self, query: str) -> str | None: query_embedding = self._embed(query) for entry in self.entries: similarity = self._cosine_similarity(query_embedding, entry["embedding"]) if similarity >= self.threshold: return entry["response"] return None
def set(self, query: str, response: str): embedding = self._embed(query) self.entries.append({"query": query, "embedding": embedding, "response": response})In production, replace the in-memory list with a vector database (Pinecone, Weaviate, pgvector) for scalable similarity search. GPTCache is a popular open-source library that implements this pattern with multiple backend options.
Threshold selection matters: A similarity threshold of 0.95+ is conservative (few false positives, lower hit rate). A threshold of 0.85 is aggressive (higher hit rate, occasional incorrect cache hits). Test on your actual query distribution before choosing.
Prompt Prefix Caching (Provider-Native)
Section titled “Prompt Prefix Caching (Provider-Native)”Several providers now offer built-in prompt caching that caches the KV (key-value) computation for repeated prompt prefixes:
- Anthropic Prompt Caching: Cache breakpoints in system prompts. Cached tokens cost 90% less. Cache lifetime: 5 minutes, extendable.
- OpenAI Prompt Caching: Automatic for prompts over 1,024 tokens. Cached input tokens cost 50% less.
This is the easiest caching win with zero application code changes — it is just a pricing discount applied automatically when your prompts share a long common prefix.
# Requires: anthropic>=0.40.0# Anthropic prompt caching — explicit cache breakpointimport anthropic
anthropic_client = anthropic.Anthropic()
response = anthropic_client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": "You are a customer support agent for Acme Corp...", }, { "type": "text", "text": "<entire_product_catalog_here>", # Large static context "cache_control": {"type": "ephemeral"}, # Cache this prefix } ], messages=[{"role": "user", "content": "What are your shipping options?"}],)The cache_control breakpoint tells Anthropic to cache everything up to that point. Subsequent requests sharing the same prefix pay 90% less for those tokens.
Batch Processing
Section titled “Batch Processing”Not all LLM calls are time-sensitive. Background summarization, report generation, content classification, and data enrichment jobs can wait minutes or hours. Batch APIs exist specifically for these workloads — at a 50% price reduction.
When to Use Batch APIs
Section titled “When to Use Batch APIs”| Pattern | Real-time API | Batch API |
|---|---|---|
| User waiting for response | Required | Not suitable |
| Nightly report generation | Wasteful | Ideal |
| Classifying 100K documents | Slow + expensive | Fast + 50% cheaper |
| Email content generation | Depends on deadline | Usually suitable |
| RAG index enrichment | Not needed | Ideal |
OpenAI Batch API Pattern
Section titled “OpenAI Batch API Pattern”# Requires: openai>=1.30.0import jsonfrom openai import OpenAIimport time
client = OpenAI()
def submit_batch_job(requests: list[dict]) -> str: """Submit a batch of LLM requests. Returns batch_id for status polling.""" # Create JSONL file jsonl_content = "\n".join( json.dumps({ "custom_id": f"request-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o-mini", "messages": req["messages"], "max_tokens": req.get("max_tokens", 500), } }) for i, req in enumerate(requests) )
# Upload input file input_file = client.files.create( file=("batch_input.jsonl", jsonl_content.encode(), "application/jsonl"), purpose="batch", )
# Create batch batch = client.batches.create( input_file_id=input_file.id, endpoint="/v1/chat/completions", completion_window="24h", )
return batch.id
def retrieve_batch_results(batch_id: str) -> list[dict]: """Poll for completion and return results.""" while True: batch = client.batches.retrieve(batch_id) if batch.status == "completed": break elif batch.status == "failed": raise RuntimeError(f"Batch failed: {batch.errors}") time.sleep(60) # Poll every minute
output_file = client.files.content(batch.output_file_id) results = [] for line in output_file.text.strip().split("\n"): result = json.loads(line) results.append({ "id": result["custom_id"], "content": result["response"]["body"]["choices"][0]["message"]["content"], })
return resultsQueue Architecture for Mixed Workloads
Section titled “Queue Architecture for Mixed Workloads”Production systems often need both real-time and batch paths. A queue-based architecture handles both:
User Request │ ├─ is_urgent? → Real-time API (full price) │ └─ not urgent → Job Queue → Batch Processor → Result Store │ (runs every 4h, 50% cost)Use a priority queue (Redis, SQS) where urgent jobs bypass the batch accumulator. Non-urgent jobs accumulate until a threshold (time elapsed or batch size) triggers a batch submission.
Cost Monitoring and Budgeting
Section titled “Cost Monitoring and Budgeting”Optimization strategies only work if you can measure their effect. Cost monitoring closes the loop — without it, you are flying blind and cannot know whether your optimizations are working or when costs are trending toward overage.
Per-Request Cost Tracking
Section titled “Per-Request Cost Tracking”Every LLM call should log token counts and estimated cost:
# Requires: python>=3.10from dataclasses import dataclassfrom datetime import datetime
@dataclassclass LLMCallMetrics: request_id: str model: str input_tokens: int output_tokens: int cached_tokens: int cost_usd: float latency_ms: float user_id: str | None feature: str timestamp: datetime
# Pricing table (update when providers change rates)PRICING = { "gpt-4o": {"input": 0.0025, "output": 0.010, "cached_input": 0.00125}, "gpt-4o-mini": {"input": 0.00015, "output": 0.0006, "cached_input": 0.000075}, "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015, "cached_input": 0.0003}, "claude-3-5-haiku-20241022": {"input": 0.0008, "output": 0.004, "cached_input": 0.00008},}
def calculate_cost(model: str, input_tokens: int, output_tokens: int, cached_tokens: int = 0) -> float: pricing = PRICING.get(model, PRICING["gpt-4o"]) # default to conservative estimate regular_input = input_tokens - cached_tokens cost = ( regular_input / 1000 * pricing["input"] + cached_tokens / 1000 * pricing.get("cached_input", pricing["input"] * 0.1) + output_tokens / 1000 * pricing["output"] ) return costBudget Alerts and Per-User Limits
Section titled “Budget Alerts and Per-User Limits”For multi-tenant applications, set per-user token budgets to prevent runaway usage:
# Requires: redis>=5.0.0import redis
class TokenBudgetManager: def __init__(self, redis_client: redis.Redis): self.redis = redis_client
def check_and_consume( self, user_id: str, estimated_tokens: int, daily_limit: int = 100_000, monthly_limit: int = 2_000_000, ) -> bool: """Return True if under budget, False if limit exceeded.""" from datetime import date
day_key = f"budget:{user_id}:day:{date.today().isoformat()}" month_key = f"budget:{user_id}:month:{date.today().strftime('%Y-%m')}"
daily_used = int(self.redis.get(day_key) or 0) monthly_used = int(self.redis.get(month_key) or 0)
if daily_used + estimated_tokens > daily_limit: return False if monthly_used + estimated_tokens > monthly_limit: return False
# Consume budget pipe = self.redis.pipeline() pipe.incrby(day_key, estimated_tokens) pipe.expire(day_key, 86400) # 1 day TTL pipe.incrby(month_key, estimated_tokens) pipe.expire(month_key, 2_592_000) # 30 day TTL pipe.execute()
return TrueKey Metrics to Track
Section titled “Key Metrics to Track”Align your monitoring with LLMOps evaluation and production observability practices:
- Cost per request — broken down by model, feature, and user segment
- Cache hit rate — semantic cache hit rate should be a KPI you actively improve
- Model distribution — percentage of requests routed to each tier
- Tokens per request — input and output separately; rising averages signal prompt drift
- Cost per active user per day — the metric that determines unit economics
- Budget utilization rate — per-user and per-feature, alerting at 80% threshold
A weekly cost review comparing these metrics against a baseline is the practice that catches the regressions that erode optimizations over time — prompt edits that silently lengthen system prompts, new features that default to expensive models, increasing conversation history depths.
Interview Preparation
Section titled “Interview Preparation”LLM cost optimization is a required topic for senior and staff GenAI engineering interviews. Interviewers want to see that you understand the economics of production systems, not just how to build demos. These questions appear frequently in GenAI system design rounds.
Q: A customer support chatbot is costing $50,000/month in API fees. The team cannot cut features or reduce users. What is your optimization plan?
A structured answer covers four parallel tracks. First, instrument the system to measure where tokens are going — system prompt size, retrieved context size, conversation history length, output token counts — broken down by request type. Second, implement model routing: classify incoming requests and route simple queries (FAQ lookups, status checks, yes/no decisions) to a Tier 1 model like GPT-4o-mini. Expect 40-60% of requests to qualify. Third, add semantic caching using an embedding similarity search against previous successful responses. FAQ-style questions typically show 30-50% cache hit rates in support contexts. Fourth, compress the system prompt and prune retrieved context to token budgets. Together, these typically reduce spend to $10,000-$20,000/month — a 60-80% reduction with no user-facing quality change.
Q: How do you decide which model to use for a given request in a production system?
The decision framework has three components. A rule-based pre-filter routes obviously simple or obviously complex requests without needing a classifier. A classifier (itself a cheap model call) handles the ambiguous middle. An escape hatch lets users or features explicitly request a higher-tier model when needed. The classifier uses signals like query length, presence of code, reasoning complexity keywords, and whether structured multi-step output is required. The routing decision is logged alongside the response, enabling offline evaluation of routing accuracy — if cheap-model responses are rated poorly by users, the classifier thresholds need adjustment.
Q: What is the difference between exact match caching and semantic caching, and when would you use each?
Exact match caching stores responses keyed by a hash of the full request (model + messages + temperature). It has zero false positives — it only returns a cached response when the request is identical. It works well for deterministic queries: API documentation lookups, fixed-format reports, admin commands. Semantic caching stores an embedding of the query alongside the response, and retrieves cached responses for new queries above a cosine similarity threshold. It handles natural language variation — the same question phrased differently. It requires careful threshold tuning: too low and you return wrong answers, too high and your hit rate is negligible. In practice, exact match handles system-generated or templated queries, semantic caching handles end-user natural language queries.
Q: Your team added a new feature that does five LLM calls per user session. How do you evaluate the cost impact before shipping?
Before shipping, estimate the per-session token cost by running the five calls against a representative sample of user queries and measuring actual token counts. Multiply by expected daily active users and model pricing. Compare against the current per-session cost baseline. If the projected cost increase is above a threshold (typically 20% of current spend), the feature requires optimization before shipping — either by collapsing calls (can any be merged?), routing to cheaper models, adding caching, or deferring non-critical calls to batch processing. This cost gate should be part of the feature’s definition of done, not an afterthought after launch.
Related
Section titled “Related”- LLM Caching — Semantic, KV, and prompt caching strategies
- LLMOps — CI/CD, eval gates, and production deployment
- LLM Evaluation — Measure quality alongside cost optimization
- Prompt Management — Track prompt changes that affect token usage
- Fine-Tuning Guide — Distill large models into cheaper alternatives
Frequently Asked Questions
How can I reduce LLM API costs?
The most effective strategies are: model routing (use cheaper models for simple tasks — saves 40-60%), prompt caching (cache repeated prompt prefixes — saves 70-90% on hits), prompt compression (remove redundant tokens — saves 20-40%), and batch processing (use batch APIs at 50% discount for non-urgent workloads). Combining these strategies can reduce costs by 80-95%.
What is model routing for cost optimization?
Model routing uses a classifier (often a small LLM or rule-based system) to determine the complexity of each request, then routes it to the most cost-effective model. Simple requests go to cheap models like GPT-4o-mini or Claude Haiku. Complex requests go to capable models like GPT-4o or Claude Sonnet. This typically reduces costs by 40-60% while maintaining quality on hard tasks.
What is semantic caching for LLMs?
Semantic caching stores LLM responses and retrieves cached answers for semantically similar (not just identical) queries. It uses embedding similarity to match incoming queries against cached ones. If a new query is similar enough (above a cosine similarity threshold), the cached response is returned instantly — zero API cost and sub-millisecond latency. Learn more in our LLM caching guide.
How much does it cost to run an LLM in production?
Costs vary enormously by model, volume, and optimization. A typical production app handling 10,000 requests/day with GPT-4o costs roughly $500-$1,500/month unoptimized. With model routing (60% to mini), caching (30% hit rate), and batch processing for async jobs, the same app costs $100-$300/month. LLM costs are proportional to tokens processed, and most optimizations reduce token count or model cost per token.
How does prompt compression reduce LLM costs?
Prompt compression strips unnecessary tokens from inputs without degrading output quality. LLMs respond equally well to compressed instruction formats — a 127-token system prompt can often be reduced to 42 tokens (67% fewer) with the same instructions. At scale across 10,000 daily requests, this saves roughly $650/year at GPT-4o pricing on a single system prompt alone.
When should I use batch APIs instead of real-time LLM calls?
Use batch APIs for workloads that are not time-sensitive: nightly report generation, classifying large document sets, email content generation, and RAG index enrichment. Both OpenAI and Anthropic offer batch APIs at a 50% price reduction with a 24-hour completion window. Reserve real-time API calls for user-facing requests where someone is waiting for a response.
What metrics should I track for LLM cost optimization?
Track six key metrics: cost per request broken down by model and feature, semantic cache hit rate as an actively improved KPI, model distribution showing percentage of requests at each tier, tokens per request (input and output separately), cost per active user per day for unit economics, and budget utilization rate per user and feature with alerts at 80% threshold.
How do I set per-user token budgets for LLM applications?
Implement per-user token budgets using a Redis-backed budget manager that tracks daily and monthly token consumption. Before each LLM call, check the estimated token count against the user's remaining budget. If the request would exceed the daily or monthly limit, reject it gracefully. This prevents runaway usage in multi-tenant applications.
What is the difference between exact match caching and semantic caching?
Exact match caching stores responses keyed by a hash of the full request — zero false positives, only returns cached responses for identical requests. It works for deterministic, templated queries. Semantic caching stores query embeddings and retrieves responses for semantically similar queries above a cosine similarity threshold, handling natural language variation. Use exact match for system-generated queries and semantic caching for end-user queries.
Why do LLM API costs spiral in production?
LLM costs compound across four factors: which model you call (GPT-4o costs roughly 30-50x more per token than GPT-4o-mini), how many input tokens you send (system prompt, retrieved context, conversation history), how many output tokens the model generates, and how often you call the API (retries, chain steps, agent loops). A RAG system doing three LLM calls per interaction can reach over $20,000/month at 10,000 requests per day.