Skip to content

LLM Cost Optimization — Token Management, Caching & Model Routing (2026)

GenAI engineers who understand LLM fundamentals and can build working RAG systems often hit a wall in production: the bill. A prototype that costs pennies a day can cost thousands per month at scale. LLM cost optimization is the discipline of making that economics work — maintaining quality while reducing cost by 60-90%. This page gives you the mental models, architecture patterns, and code to do it.

Who this is for:

  • Backend and GenAI engineers shipping LLM-powered features to production who have received their first large API bill and need to fix it.
  • Engineering managers evaluating whether a GenAI feature is economically viable at scale.
  • Senior engineers preparing for interviews — cost optimization is a required topic for GenAI system design rounds at staff level and above.
  • Architects designing multi-model pipelines who need to reason about cost vs. capability tradeoffs at each layer.

LLM APIs price by token — both input tokens you send and output tokens the model generates. This creates a cost structure that surprises most engineers coming from traditional software.

In a conventional web service, adding a user or doubling traffic costs you compute and bandwidth. Both are cheap and predictable. In an LLM-powered service, the cost per request is determined by:

  1. Which model you call — GPT-4o costs roughly 30-50x more per token than GPT-4o-mini. Claude Sonnet costs roughly 10x more than Claude Haiku.
  2. How many input tokens you send — your system prompt, retrieved context, conversation history, and the user’s query all count.
  3. How many output tokens the model generates — longer, more detailed responses cost proportionally more.
  4. How often you call the API — every retry, every re-query, every chain step is a separate charge.

These factors compound. A RAG system that sends a 2,000-token system prompt, retrieves 3,000 tokens of context, processes a 200-token user query, and generates an 800-token response uses 6,000 tokens per call. At GPT-4o pricing (roughly $0.0025/1K input + $0.01/1K output), that is roughly $0.023 per request. At 10,000 requests per day, that is $230/day — $6,900/month — from a single feature.

Now imagine that feature does three LLM calls per user interaction (retrieval, generation, summarization). $20,700/month. Now add a multi-agent workflow where agents call each other. Costs can reach six figures monthly for moderately popular applications.

Understanding context windows is especially important here: larger context windows do not just change what the model can see — they change what you pay. Stuffing a 100K-token context through GPT-4o on every request will bankrupt a product quickly.

ModelInput cost/1M tokensOutput cost/1M tokensRelative cost
GPT-4o$2.50$10.00Baseline
GPT-4o-mini$0.15$0.60~17x cheaper
Claude 3.5 Sonnet$3.00$15.00Similar to GPT-4o
Claude 3.5 Haiku$0.80$4.00~4x cheaper
Llama 3.1 70B (self-hosted)$0.00$0.00Infra cost only

Prices are approximate as of early 2026. Verify against official provider pricing pages before committing to an architecture.

The implication is direct: routing 70% of your requests to a model that is 17x cheaper — while preserving quality for the hard 30% — cuts total API spend by roughly 65% without users noticing.


Model Routing — The Highest-Leverage Optimization

Section titled “Model Routing — The Highest-Leverage Optimization”

Model routing is the practice of classifying each incoming request by complexity, then directing it to the cheapest model capable of handling it. It is the single highest-leverage cost reduction technique because it addresses the root cause: paying expensive-model prices for tasks that do not require expensive-model capability.

Most production systems use three tiers:

  • Tier 1 (cheap): Simple classification, keyword extraction, format conversion, yes/no decisions. Models: GPT-4o-mini, Claude Haiku, Gemini Flash. Cost: $0.001-0.005 per call.
  • Tier 2 (mid): Multi-step reasoning, summarization, structured data extraction, code explanation. Models: GPT-4o, Claude Sonnet, Gemini Pro. Cost: $0.01-0.05 per call.
  • Tier 3 (expensive): Complex code generation, nuanced creative work, long-form analysis, tasks requiring world knowledge. Models: o1/o3, Claude Sonnet 3.7, Gemini Ultra. Cost: $0.05-0.50 per call.

The goal is to push as many requests as possible to Tier 1 while routing only genuinely hard tasks to Tier 3.

A complexity classifier is itself an LLM call — but a very cheap one. You use a fast, inexpensive model to categorize the incoming request, then route based on that categorization.

# Requires: openai>=1.30.0
from openai import OpenAI
from enum import Enum
client = OpenAI()
class ComplexityTier(str, Enum):
SIMPLE = "simple" # Tier 1: cheap model
MEDIUM = "medium" # Tier 2: mid model
COMPLEX = "complex" # Tier 3: expensive model
CLASSIFIER_PROMPT = """Classify the complexity of this user request. Reply with exactly one word:
- "simple" — format conversion, yes/no answer, keyword extraction, basic lookup
- "medium" — summarization, multi-step reasoning, code explanation, structured extraction
- "complex" — original code generation, nuanced analysis, multi-document synthesis, mathematical reasoning
User request: {query}"""
MODEL_MAP = {
ComplexityTier.SIMPLE: "gpt-4o-mini",
ComplexityTier.MEDIUM: "gpt-4o",
ComplexityTier.COMPLEX: "o1-mini",
}
def classify_and_route(query: str, system_prompt: str, context: str = "") -> str:
# Step 1: classify with cheapest model
classification_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}],
max_tokens=10,
temperature=0,
)
tier_str = classification_response.choices[0].message.content.strip().lower()
# Default to medium if classifier output is unexpected
try:
tier = ComplexityTier(tier_str)
except ValueError:
tier = ComplexityTier.MEDIUM
target_model = MODEL_MAP[tier]
# Step 2: route to appropriate model
messages = [{"role": "system", "content": system_prompt}]
if context:
messages.append({"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"})
else:
messages.append({"role": "user", "content": query})
response = client.chat.completions.create(
model=target_model,
messages=messages,
)
return response.choices[0].message.content

For predictable workloads, skip the classifier entirely and route by signal:

def rule_based_router(query: str, context_length: int, has_code: bool) -> str:
# Long context always needs capable model
if context_length > 8000:
return "gpt-4o"
# Code generation is complex
if has_code and any(kw in query.lower() for kw in ["generate", "write", "implement", "create"]):
return "gpt-4o"
# Short queries with no code are simple
if len(query.split()) < 20 and not has_code:
return "gpt-4o-mini"
return "gpt-4o" # default to mid tier

Rule-based routing has zero latency overhead and zero additional cost. It works well when your request types are predictable — customer support bots, document processing pipelines, structured data extraction.


The diagram below shows how a production system routes requests through optimization layers before an expensive model call ever happens.

LLM Cost Optimization Pipeline

Each layer catches a request before it reaches a more expensive tier. Most requests should resolve before reaching Tier 2 or Tier 3 models.

Inbound RequestRaw user query enters the system
Query received
Token count estimated
Request fingerprint computed
Cache CheckSemantic + exact match — 0 API cost on hit
Exact hash lookup
Embedding similarity search
Cache HIT → return instantly
Model RouterClassify complexity, select cheapest capable model
Rule-based signals
Classifier (Tier 1 model)
Model selected
Prompt OptimizerCompress input before sending to model
Trim system prompt
Compress context
Remove filler tokens
LLM CallOnly the requests that survived all filters reach here
Tier 1: mini model
Tier 2: mid model
Tier 3: full model
Response StoreCache response for future reuse
Store in semantic cache
Log tokens + cost
Update budget tracker
Idle

This pipeline architecture means expensive model calls are the last resort, not the default. A well-tuned pipeline sees 30-50% of requests resolved by cache and 40-60% of remaining requests handled by Tier 1 models.


Even after routing to the right model, you can significantly reduce per-call costs by reducing the number of tokens you send. Prompt compression is the practice of stripping unnecessary tokens from your inputs without degrading output quality.

System prompts are a common source of token waste. Engineers write system prompts in natural language for readability, but LLMs do not need grammatically correct prose. They respond equally well to compressed instruction formats.

Before (127 tokens):

You are a helpful customer support assistant for Acme Corp. Your job is to help customers
with their questions about our products and services. You should always be polite and
professional. If you don't know the answer to a question, you should say so and offer
to connect the customer with a human agent. Never make up information about products
or pricing. Always respond in the same language the customer uses.

After (42 tokens):

Role: Acme Corp support agent.
Rules: Polite, professional. Match user language.
If unknown: say so, offer human escalation.
Never fabricate product/pricing info.

Same instructions. 67% fewer tokens. A system prompt running at 127 tokens across 10,000 daily requests costs 1.27M input tokens per day. At 42 tokens, that is 420K. The savings at GPT-4o pricing: roughly $2.13/day — $650/year from prompt compression alone.

In RAG systems, retrieved context is frequently the largest token consumer. Standard RAG retrieves top-k chunks regardless of actual relevance. Dynamic pruning removes chunks below a relevance threshold before they enter the prompt.

# Requires: python>=3.9
from typing import List, Tuple
def prune_retrieved_chunks(
chunks: List[Tuple[str, float]], # (text, similarity_score)
min_similarity: float = 0.75,
max_tokens: int = 3000,
approx_tokens_per_word: float = 1.3,
) -> str:
"""Filter chunks by relevance and token budget, return joined context."""
filtered = [text for text, score in chunks if score >= min_similarity]
# Build context within token budget
context_parts = []
token_count = 0
for chunk in filtered:
chunk_tokens = int(len(chunk.split()) * approx_tokens_per_word)
if token_count + chunk_tokens > max_tokens:
break
context_parts.append(chunk)
token_count += chunk_tokens
return "\n\n".join(context_parts)

Agents and chatbots accumulate conversation history that grows without bound. Strategies to manage it:

  • Sliding window: Keep only the last N turns. Simple, predictable token cost.
  • Summarization compression: Periodically summarize older turns into a compact summary, replace them with the summary. Preserves context without accumulating raw tokens.
  • Importance scoring: Score each message by relevance to the current query, drop low-scoring old messages.
def truncate_history_sliding_window(
messages: list,
max_turns: int = 10,
always_keep_system: bool = True,
) -> list:
"""Keep system message + last N user/assistant turn pairs."""
system_messages = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
# Keep last max_turns * 2 messages (each turn = user + assistant)
truncated_conversation = conversation[-(max_turns * 2):]
if always_keep_system:
return system_messages + truncated_conversation
return truncated_conversation

Caching is the highest-efficiency optimization available — a cache hit costs nothing and returns instantly. Three distinct caching patterns apply to LLM workloads.

For deterministic, repeated queries — API documentation lookups, FAQ answers, status checks — exact string matching is sufficient. Store the query string as a hash key, the response as the value.

# Requires: redis>=5.0.0
import hashlib
import json
import redis
class ExactMatchCache:
def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
self.cache = redis_client
self.ttl = ttl_seconds
def _make_key(self, model: str, messages: list, temperature: float) -> str:
payload = json.dumps({"model": model, "messages": messages, "temperature": temperature}, sort_keys=True)
return f"llm:exact:{hashlib.sha256(payload.encode()).hexdigest()}"
def get(self, model: str, messages: list, temperature: float = 0.0) -> str | None:
key = self._make_key(model, messages, temperature)
cached = self.cache.get(key)
return cached.decode() if cached else None
def set(self, model: str, messages: list, response: str, temperature: float = 0.0):
key = self._make_key(model, messages, temperature)
self.cache.setex(key, self.ttl, response)

Exact match works best with temperature=0 calls. Caching non-deterministic outputs (temperature > 0) can return stale responses — use it only when the response is meant to be stable.

Semantic caching matches queries by meaning, not exact text. A user asking “What is your return policy?” and another asking “Can I return something I bought?” should get the same cached answer.

The pattern uses embedding similarity:

# Requires: openai>=1.30.0, numpy>=1.26.0
import numpy as np
from openai import OpenAI
client = OpenAI()
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.92):
self.entries: list[dict] = [] # In production: use a vector DB
self.threshold = similarity_threshold
def _embed(self, text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def get(self, query: str) -> str | None:
query_embedding = self._embed(query)
for entry in self.entries:
similarity = self._cosine_similarity(query_embedding, entry["embedding"])
if similarity >= self.threshold:
return entry["response"]
return None
def set(self, query: str, response: str):
embedding = self._embed(query)
self.entries.append({"query": query, "embedding": embedding, "response": response})

In production, replace the in-memory list with a vector database (Pinecone, Weaviate, pgvector) for scalable similarity search. GPTCache is a popular open-source library that implements this pattern with multiple backend options.

Threshold selection matters: A similarity threshold of 0.95+ is conservative (few false positives, lower hit rate). A threshold of 0.85 is aggressive (higher hit rate, occasional incorrect cache hits). Test on your actual query distribution before choosing.

Several providers now offer built-in prompt caching that caches the KV (key-value) computation for repeated prompt prefixes:

  • Anthropic Prompt Caching: Cache breakpoints in system prompts. Cached tokens cost 90% less. Cache lifetime: 5 minutes, extendable.
  • OpenAI Prompt Caching: Automatic for prompts over 1,024 tokens. Cached input tokens cost 50% less.

This is the easiest caching win with zero application code changes — it is just a pricing discount applied automatically when your prompts share a long common prefix.

# Requires: anthropic>=0.40.0
# Anthropic prompt caching — explicit cache breakpoint
import anthropic
anthropic_client = anthropic.Anthropic()
response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp...",
},
{
"type": "text",
"text": "<entire_product_catalog_here>", # Large static context
"cache_control": {"type": "ephemeral"}, # Cache this prefix
}
],
messages=[{"role": "user", "content": "What are your shipping options?"}],
)

The cache_control breakpoint tells Anthropic to cache everything up to that point. Subsequent requests sharing the same prefix pay 90% less for those tokens.


Not all LLM calls are time-sensitive. Background summarization, report generation, content classification, and data enrichment jobs can wait minutes or hours. Batch APIs exist specifically for these workloads — at a 50% price reduction.

PatternReal-time APIBatch API
User waiting for responseRequiredNot suitable
Nightly report generationWastefulIdeal
Classifying 100K documentsSlow + expensiveFast + 50% cheaper
Email content generationDepends on deadlineUsually suitable
RAG index enrichmentNot neededIdeal
# Requires: openai>=1.30.0
import json
from openai import OpenAI
import time
client = OpenAI()
def submit_batch_job(requests: list[dict]) -> str:
"""Submit a batch of LLM requests. Returns batch_id for status polling."""
# Create JSONL file
jsonl_content = "\n".join(
json.dumps({
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": req["messages"],
"max_tokens": req.get("max_tokens", 500),
}
})
for i, req in enumerate(requests)
)
# Upload input file
input_file = client.files.create(
file=("batch_input.jsonl", jsonl_content.encode(), "application/jsonl"),
purpose="batch",
)
# Create batch
batch = client.batches.create(
input_file_id=input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
return batch.id
def retrieve_batch_results(batch_id: str) -> list[dict]:
"""Poll for completion and return results."""
while True:
batch = client.batches.retrieve(batch_id)
if batch.status == "completed":
break
elif batch.status == "failed":
raise RuntimeError(f"Batch failed: {batch.errors}")
time.sleep(60) # Poll every minute
output_file = client.files.content(batch.output_file_id)
results = []
for line in output_file.text.strip().split("\n"):
result = json.loads(line)
results.append({
"id": result["custom_id"],
"content": result["response"]["body"]["choices"][0]["message"]["content"],
})
return results

Production systems often need both real-time and batch paths. A queue-based architecture handles both:

User Request
├─ is_urgent? → Real-time API (full price)
└─ not urgent → Job Queue → Batch Processor → Result Store
(runs every 4h, 50% cost)

Use a priority queue (Redis, SQS) where urgent jobs bypass the batch accumulator. Non-urgent jobs accumulate until a threshold (time elapsed or batch size) triggers a batch submission.


Optimization strategies only work if you can measure their effect. Cost monitoring closes the loop — without it, you are flying blind and cannot know whether your optimizations are working or when costs are trending toward overage.

Every LLM call should log token counts and estimated cost:

# Requires: python>=3.10
from dataclasses import dataclass
from datetime import datetime
@dataclass
class LLMCallMetrics:
request_id: str
model: str
input_tokens: int
output_tokens: int
cached_tokens: int
cost_usd: float
latency_ms: float
user_id: str | None
feature: str
timestamp: datetime
# Pricing table (update when providers change rates)
PRICING = {
"gpt-4o": {"input": 0.0025, "output": 0.010, "cached_input": 0.00125},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006, "cached_input": 0.000075},
"claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015, "cached_input": 0.0003},
"claude-3-5-haiku-20241022": {"input": 0.0008, "output": 0.004, "cached_input": 0.00008},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int, cached_tokens: int = 0) -> float:
pricing = PRICING.get(model, PRICING["gpt-4o"]) # default to conservative estimate
regular_input = input_tokens - cached_tokens
cost = (
regular_input / 1000 * pricing["input"]
+ cached_tokens / 1000 * pricing.get("cached_input", pricing["input"] * 0.1)
+ output_tokens / 1000 * pricing["output"]
)
return cost

For multi-tenant applications, set per-user token budgets to prevent runaway usage:

# Requires: redis>=5.0.0
import redis
class TokenBudgetManager:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def check_and_consume(
self,
user_id: str,
estimated_tokens: int,
daily_limit: int = 100_000,
monthly_limit: int = 2_000_000,
) -> bool:
"""Return True if under budget, False if limit exceeded."""
from datetime import date
day_key = f"budget:{user_id}:day:{date.today().isoformat()}"
month_key = f"budget:{user_id}:month:{date.today().strftime('%Y-%m')}"
daily_used = int(self.redis.get(day_key) or 0)
monthly_used = int(self.redis.get(month_key) or 0)
if daily_used + estimated_tokens > daily_limit:
return False
if monthly_used + estimated_tokens > monthly_limit:
return False
# Consume budget
pipe = self.redis.pipeline()
pipe.incrby(day_key, estimated_tokens)
pipe.expire(day_key, 86400) # 1 day TTL
pipe.incrby(month_key, estimated_tokens)
pipe.expire(month_key, 2_592_000) # 30 day TTL
pipe.execute()
return True

Align your monitoring with LLMOps evaluation and production observability practices:

  • Cost per request — broken down by model, feature, and user segment
  • Cache hit rate — semantic cache hit rate should be a KPI you actively improve
  • Model distribution — percentage of requests routed to each tier
  • Tokens per request — input and output separately; rising averages signal prompt drift
  • Cost per active user per day — the metric that determines unit economics
  • Budget utilization rate — per-user and per-feature, alerting at 80% threshold

A weekly cost review comparing these metrics against a baseline is the practice that catches the regressions that erode optimizations over time — prompt edits that silently lengthen system prompts, new features that default to expensive models, increasing conversation history depths.


LLM cost optimization is a required topic for senior and staff GenAI engineering interviews. Interviewers want to see that you understand the economics of production systems, not just how to build demos. These questions appear frequently in GenAI system design rounds.

Q: A customer support chatbot is costing $50,000/month in API fees. The team cannot cut features or reduce users. What is your optimization plan?

A structured answer covers four parallel tracks. First, instrument the system to measure where tokens are going — system prompt size, retrieved context size, conversation history length, output token counts — broken down by request type. Second, implement model routing: classify incoming requests and route simple queries (FAQ lookups, status checks, yes/no decisions) to a Tier 1 model like GPT-4o-mini. Expect 40-60% of requests to qualify. Third, add semantic caching using an embedding similarity search against previous successful responses. FAQ-style questions typically show 30-50% cache hit rates in support contexts. Fourth, compress the system prompt and prune retrieved context to token budgets. Together, these typically reduce spend to $10,000-$20,000/month — a 60-80% reduction with no user-facing quality change.

Q: How do you decide which model to use for a given request in a production system?

The decision framework has three components. A rule-based pre-filter routes obviously simple or obviously complex requests without needing a classifier. A classifier (itself a cheap model call) handles the ambiguous middle. An escape hatch lets users or features explicitly request a higher-tier model when needed. The classifier uses signals like query length, presence of code, reasoning complexity keywords, and whether structured multi-step output is required. The routing decision is logged alongside the response, enabling offline evaluation of routing accuracy — if cheap-model responses are rated poorly by users, the classifier thresholds need adjustment.

Q: What is the difference between exact match caching and semantic caching, and when would you use each?

Exact match caching stores responses keyed by a hash of the full request (model + messages + temperature). It has zero false positives — it only returns a cached response when the request is identical. It works well for deterministic queries: API documentation lookups, fixed-format reports, admin commands. Semantic caching stores an embedding of the query alongside the response, and retrieves cached responses for new queries above a cosine similarity threshold. It handles natural language variation — the same question phrased differently. It requires careful threshold tuning: too low and you return wrong answers, too high and your hit rate is negligible. In practice, exact match handles system-generated or templated queries, semantic caching handles end-user natural language queries.

Q: Your team added a new feature that does five LLM calls per user session. How do you evaluate the cost impact before shipping?

Before shipping, estimate the per-session token cost by running the five calls against a representative sample of user queries and measuring actual token counts. Multiply by expected daily active users and model pricing. Compare against the current per-session cost baseline. If the projected cost increase is above a threshold (typically 20% of current spend), the feature requires optimization before shipping — either by collapsing calls (can any be merged?), routing to cheaper models, adding caching, or deferring non-critical calls to batch processing. This cost gate should be part of the feature’s definition of done, not an afterthought after launch.


Frequently Asked Questions

How can I reduce LLM API costs?

The most effective strategies are: model routing (use cheaper models for simple tasks — saves 40-60%), prompt caching (cache repeated prompt prefixes — saves 70-90% on hits), prompt compression (remove redundant tokens — saves 20-40%), and batch processing (use batch APIs at 50% discount for non-urgent workloads). Combining these strategies can reduce costs by 80-95%.

What is model routing for cost optimization?

Model routing uses a classifier (often a small LLM or rule-based system) to determine the complexity of each request, then routes it to the most cost-effective model. Simple requests go to cheap models like GPT-4o-mini or Claude Haiku. Complex requests go to capable models like GPT-4o or Claude Sonnet. This typically reduces costs by 40-60% while maintaining quality on hard tasks.

What is semantic caching for LLMs?

Semantic caching stores LLM responses and retrieves cached answers for semantically similar (not just identical) queries. It uses embedding similarity to match incoming queries against cached ones. If a new query is similar enough (above a cosine similarity threshold), the cached response is returned instantly — zero API cost and sub-millisecond latency. Learn more in our LLM caching guide.

How much does it cost to run an LLM in production?

Costs vary enormously by model, volume, and optimization. A typical production app handling 10,000 requests/day with GPT-4o costs roughly $500-$1,500/month unoptimized. With model routing (60% to mini), caching (30% hit rate), and batch processing for async jobs, the same app costs $100-$300/month. LLM costs are proportional to tokens processed, and most optimizations reduce token count or model cost per token.

How does prompt compression reduce LLM costs?

Prompt compression strips unnecessary tokens from inputs without degrading output quality. LLMs respond equally well to compressed instruction formats — a 127-token system prompt can often be reduced to 42 tokens (67% fewer) with the same instructions. At scale across 10,000 daily requests, this saves roughly $650/year at GPT-4o pricing on a single system prompt alone.

When should I use batch APIs instead of real-time LLM calls?

Use batch APIs for workloads that are not time-sensitive: nightly report generation, classifying large document sets, email content generation, and RAG index enrichment. Both OpenAI and Anthropic offer batch APIs at a 50% price reduction with a 24-hour completion window. Reserve real-time API calls for user-facing requests where someone is waiting for a response.

What metrics should I track for LLM cost optimization?

Track six key metrics: cost per request broken down by model and feature, semantic cache hit rate as an actively improved KPI, model distribution showing percentage of requests at each tier, tokens per request (input and output separately), cost per active user per day for unit economics, and budget utilization rate per user and feature with alerts at 80% threshold.

How do I set per-user token budgets for LLM applications?

Implement per-user token budgets using a Redis-backed budget manager that tracks daily and monthly token consumption. Before each LLM call, check the estimated token count against the user's remaining budget. If the request would exceed the daily or monthly limit, reject it gracefully. This prevents runaway usage in multi-tenant applications.

What is the difference between exact match caching and semantic caching?

Exact match caching stores responses keyed by a hash of the full request — zero false positives, only returns cached responses for identical requests. It works for deterministic, templated queries. Semantic caching stores query embeddings and retrieves responses for semantically similar queries above a cosine similarity threshold, handling natural language variation. Use exact match for system-generated queries and semantic caching for end-user queries.

Why do LLM API costs spiral in production?

LLM costs compound across four factors: which model you call (GPT-4o costs roughly 30-50x more per token than GPT-4o-mini), how many input tokens you send (system prompt, retrieved context, conversation history), how many output tokens the model generates, and how often you call the API (retries, chain steps, agent loops). A RAG system doing three LLM calls per interaction can reach over $20,000/month at 10,000 requests per day.