LLM Cost Optimization — Token Management, Caching & Model Routing (2026)

Q: How can I reduce LLM API costs?

The most effective cost reduction strategies are: model routing (use cheaper models for simple tasks, expensive models only for complex ones — saves 40-60%), prompt caching (cache repeated prompt prefixes — saves 70-90% on cache hits), prompt compression (remove redundant tokens — saves 20-40%), and batch processing (use batch APIs at 50% discount for non-urgent workloads). Combining these strategies can reduce costs by 80-95%.

GenAI engineers who understand LLM fundamentals and can build working RAG systems often hit a wall in production: the bill. A prototype that costs pennies a day can cost thousands per month at scale. LLM cost optimization is the discipline of making that economics work — maintaining quality while reducing cost by 60-90%. This page gives you the mental models, architecture patterns, and code to do it.

Who this is for:

Backend and GenAI engineers shipping LLM-powered features to production who have received their first large API bill and need to fix it.
Engineering managers evaluating whether a GenAI feature is economically viable at scale.
Senior engineers preparing for interviews — cost optimization is a required topic for GenAI system design rounds at staff level and above.
Architects designing multi-model pipelines who need to reason about cost vs. capability tradeoffs at each layer.

The Cost Problem — Why LLM Bills Spiral

LLM APIs price by token — both input tokens you send and output tokens the model generates. This creates a cost structure that surprises most engineers coming from traditional software.

In a conventional web service, adding a user or doubling traffic costs you compute and bandwidth. Both are cheap and predictable. In an LLM-powered service, the cost per request is determined by:

Which model you call — GPT-4o costs roughly 30-50x more per token than GPT-4o-mini. Claude Sonnet costs roughly 10x more than Claude Haiku.
How many input tokens you send — your system prompt, retrieved context, conversation history, and the user’s query all count.
How many output tokens the model generates — longer, more detailed responses cost proportionally more.
How often you call the API — every retry, every re-query, every chain step is a separate charge.

The Compounding Problem

These factors compound. A RAG system that sends a 2,000-token system prompt, retrieves 3,000 tokens of context, processes a 200-token user query, and generates an 800-token response uses 6,000 tokens per call. At GPT-4o pricing (roughly $0.0025/1K input + $0.01/1K output), that is roughly $0.023 per request. At 10,000 requests per day, that is $230/day — $6,900/month — from a single feature.

Now imagine that feature does three LLM calls per user interaction (retrieval, generation, summarization). $20,700/month. Now add a multi-agent workflow where agents call each other. Costs can reach six figures monthly for moderately popular applications.

Understanding context windows is especially important here: larger context windows do not just change what the model can see — they change what you pay. Stuffing a 100K-token context through GPT-4o on every request will bankrupt a product quickly.

Token Economics at a Glance

Model	Input cost/1M tokens	Output cost/1M tokens	Relative cost
GPT-4o	$2.50	$10.00	Baseline
GPT-4o-mini	$0.15	$0.60	~17x cheaper
Claude 3.5 Sonnet	$3.00	$15.00	Similar to GPT-4o
Claude 3.5 Haiku	$0.80	$4.00	~4x cheaper
Llama 3.1 70B (self-hosted)	$0.00	$0.00	Infra cost only

Prices are approximate as of early 2026. Verify against official provider pricing pages before committing to an architecture.

The implication is direct: routing 70% of your requests to a model that is 17x cheaper — while preserving quality for the hard 30% — cuts total API spend by roughly 65% without users noticing.

Model Routing — The Highest-Leverage Optimization

Model routing is the practice of classifying each incoming request by complexity, then directing it to the cheapest model capable of handling it. It is the single highest-leverage cost reduction technique because it addresses the root cause: paying expensive-model prices for tasks that do not require expensive-model capability.

The Tiered Model Pattern

Most production systems use three tiers:

Tier 1 (cheap): Simple classification, keyword extraction, format conversion, yes/no decisions. Models: GPT-4o-mini, Claude Haiku, Gemini Flash. Cost: $0.001-0.005 per call.
Tier 2 (mid): Multi-step reasoning, summarization, structured data extraction, code explanation. Models: GPT-4o, Claude Sonnet, Gemini Pro. Cost: $0.01-0.05 per call.
Tier 3 (expensive): Complex code generation, nuanced creative work, long-form analysis, tasks requiring world knowledge. Models: o1/o3, Claude Sonnet 3.7, Gemini Ultra. Cost: $0.05-0.50 per call.

The goal is to push as many requests as possible to Tier 1 while routing only genuinely hard tasks to Tier 3.

Building a Complexity Classifier

A complexity classifier is itself an LLM call — but a very cheap one. You use a fast, inexpensive model to categorize the incoming request, then route based on that categorization.

# Requires: openai>=1.30.0
from openai import OpenAI
from enum import Enum

client = OpenAI()

class ComplexityTier(str, Enum):
    SIMPLE = "simple"    # Tier 1: cheap model
    MEDIUM = "medium"    # Tier 2: mid model
    COMPLEX = "complex"  # Tier 3: expensive model

CLASSIFIER_PROMPT = """Classify the complexity of this user request. Reply with exactly one word:
- "simple" — format conversion, yes/no answer, keyword extraction, basic lookup
- "medium" — summarization, multi-step reasoning, code explanation, structured extraction
- "complex" — original code generation, nuanced analysis, multi-document synthesis, mathematical reasoning

User request: {query}"""

MODEL_MAP = {
    ComplexityTier.SIMPLE: "gpt-4o-mini",
    ComplexityTier.MEDIUM: "gpt-4o",
    ComplexityTier.COMPLEX: "o1-mini",
}

def classify_and_route(query: str, system_prompt: str, context: str = "") -> str:
    # Step 1: classify with cheapest model
    classification_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}],
        max_tokens=10,
        temperature=0,
    )
    tier_str = classification_response.choices[0].message.content.strip().lower()

    # Default to medium if classifier output is unexpected
    try:
        tier = ComplexityTier(tier_str)
    except ValueError:
        tier = ComplexityTier.MEDIUM

    target_model = MODEL_MAP[tier]

    # Step 2: route to appropriate model
    messages = [{"role": "system", "content": system_prompt}]
    if context:
        messages.append({"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"})
    else:
        messages.append({"role": "user", "content": query})

    response = client.chat.completions.create(
        model=target_model,
        messages=messages,
    )

    return response.choices[0].message.content

Rule-Based Routing (Zero Classifier Cost)

For predictable workloads, skip the classifier entirely and route by signal:

def rule_based_router(query: str, context_length: int, has_code: bool) -> str:
    # Long context always needs capable model
    if context_length > 8000:
        return "gpt-4o"

    # Code generation is complex
    if has_code and any(kw in query.lower() for kw in ["generate", "write", "implement", "create"]):
        return "gpt-4o"

    # Short queries with no code are simple
    if len(query.split()) < 20 and not has_code:
        return "gpt-4o-mini"

    return "gpt-4o"  # default to mid tier

Rule-based routing has zero latency overhead and zero additional cost. It works well when your request types are predictable — customer support bots, document processing pipelines, structured data extraction.

Cost Optimization Pipeline

The diagram below shows how a production system routes requests through optimization layers before an expensive model call ever happens.

📊 Visual Explanation

LLM Cost Optimization Pipeline

Each layer catches a request before it reaches a more expensive tier. Most requests should resolve before reaching Tier 2 or Tier 3 models.

Inbound RequestRaw user query enters the system

Query received

Token count estimated

Request fingerprint computed

Cache CheckSemantic + exact match — 0 API cost on hit

Exact hash lookup

Embedding similarity search

Cache HIT → return instantly

Model RouterClassify complexity, select cheapest capable model

Rule-based signals

Classifier (Tier 1 model)

Model selected

Prompt OptimizerCompress input before sending to model

Trim system prompt

Compress context

Remove filler tokens

LLM CallOnly the requests that survived all filters reach here

Tier 1: mini model

Tier 2: mid model

Tier 3: full model

Response StoreCache response for future reuse

Store in semantic cache

Log tokens + cost

Update budget tracker

Idle

This pipeline architecture means expensive model calls are the last resort, not the default. A well-tuned pipeline sees 30-50% of requests resolved by cache and 40-60% of remaining requests handled by Tier 1 models.

Prompt Compression and Token Management

Even after routing to the right model, you can significantly reduce per-call costs by reducing the number of tokens you send. Prompt compression is the practice of stripping unnecessary tokens from your inputs without degrading output quality.

System Prompt Compression

System prompts are a common source of token waste. Engineers write system prompts in natural language for readability, but LLMs do not need grammatically correct prose. They respond equally well to compressed instruction formats.

Before (127 tokens):

You are a helpful customer support assistant for Acme Corp. Your job is to help customers
with their questions about our products and services. You should always be polite and
professional. If you don't know the answer to a question, you should say so and offer
to connect the customer with a human agent. Never make up information about products
or pricing. Always respond in the same language the customer uses.

After (42 tokens):

Role: Acme Corp support agent.
Rules: Polite, professional. Match user language.
If unknown: say so, offer human escalation.
Never fabricate product/pricing info.

Same instructions. 67% fewer tokens. A system prompt running at 127 tokens across 10,000 daily requests costs 1.27M input tokens per day. At 42 tokens, that is 420K. The savings at GPT-4o pricing: roughly $2.13/day — $650/year from prompt compression alone.

Dynamic Context Pruning

In RAG systems, retrieved context is frequently the largest token consumer. Standard RAG retrieves top-k chunks regardless of actual relevance. Dynamic pruning removes chunks below a relevance threshold before they enter the prompt.

# Requires: python>=3.9
from typing import List, Tuple

def prune_retrieved_chunks(
    chunks: List[Tuple[str, float]],  # (text, similarity_score)
    min_similarity: float = 0.75,
    max_tokens: int = 3000,
    approx_tokens_per_word: float = 1.3,
) -> str:
    """Filter chunks by relevance and token budget, return joined context."""
    filtered = [text for text, score in chunks if score >= min_similarity]

    # Build context within token budget
    context_parts = []
    token_count = 0

    for chunk in filtered:
        chunk_tokens = int(len(chunk.split()) * approx_tokens_per_word)
        if token_count + chunk_tokens > max_tokens:
            break
        context_parts.append(chunk)
        token_count += chunk_tokens

    return "\n\n".join(context_parts)

Conversation History Truncation

Agents and chatbots accumulate conversation history that grows without bound. Strategies to manage it:

Sliding window: Keep only the last N turns. Simple, predictable token cost.
Summarization compression: Periodically summarize older turns into a compact summary, replace them with the summary. Preserves context without accumulating raw tokens.
Importance scoring: Score each message by relevance to the current query, drop low-scoring old messages.

def truncate_history_sliding_window(
    messages: list,
    max_turns: int = 10,
    always_keep_system: bool = True,
) -> list:
    """Keep system message + last N user/assistant turn pairs."""
    system_messages = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]

    # Keep last max_turns * 2 messages (each turn = user + assistant)
    truncated_conversation = conversation[-(max_turns * 2):]

    if always_keep_system:
        return system_messages + truncated_conversation
    return truncated_conversation

Caching Strategies

Caching is the highest-efficiency optimization available — a cache hit costs nothing and returns instantly. Three distinct caching patterns apply to LLM workloads.

Exact Match Cache

For deterministic, repeated queries — API documentation lookups, FAQ answers, status checks — exact string matching is sufficient. Store the query string as a hash key, the response as the value.

# Requires: redis>=5.0.0
import hashlib
import json
import redis

class ExactMatchCache:
    def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
        self.cache = redis_client
        self.ttl = ttl_seconds

    def _make_key(self, model: str, messages: list, temperature: float) -> str:
        payload = json.dumps({"model": model, "messages": messages, "temperature": temperature}, sort_keys=True)
        return f"llm:exact:{hashlib.sha256(payload.encode()).hexdigest()}"

    def get(self, model: str, messages: list, temperature: float = 0.0) -> str | None:
        key = self._make_key(model, messages, temperature)
        cached = self.cache.get(key)
        return cached.decode() if cached else None

    def set(self, model: str, messages: list, response: str, temperature: float = 0.0):
        key = self._make_key(model, messages, temperature)
        self.cache.setex(key, self.ttl, response)

Exact match works best with temperature=0 calls. Caching non-deterministic outputs (temperature > 0) can return stale responses — use it only when the response is meant to be stable.

Semantic Cache

Semantic caching matches queries by meaning, not exact text. A user asking “What is your return policy?” and another asking “Can I return something I bought?” should get the same cached answer.

The pattern uses embedding similarity:

# Requires: openai>=1.30.0, numpy>=1.26.0
import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92):
        self.entries: list[dict] = []  # In production: use a vector DB
        self.threshold = similarity_threshold

    def _embed(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a_arr, b_arr = np.array(a), np.array(b)
        return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

    def get(self, query: str) -> str | None:
        query_embedding = self._embed(query)
        for entry in self.entries:
            similarity = self._cosine_similarity(query_embedding, entry["embedding"])
            if similarity >= self.threshold:
                return entry["response"]
        return None

    def set(self, query: str, response: str):
        embedding = self._embed(query)
        self.entries.append({"query": query, "embedding": embedding, "response": response})

In production, replace the in-memory list with a vector database (Pinecone, Weaviate, pgvector) for scalable similarity search. GPTCache is a popular open-source library that implements this pattern with multiple backend options.

Threshold selection matters: A similarity threshold of 0.95+ is conservative (few false positives, lower hit rate). A threshold of 0.85 is aggressive (higher hit rate, occasional incorrect cache hits). Test on your actual query distribution before choosing.

Prompt Prefix Caching (Provider-Native)

Several providers now offer built-in prompt caching that caches the KV (key-value) computation for repeated prompt prefixes:

Anthropic Prompt Caching: Cache breakpoints in system prompts. Cached tokens cost 90% less. Cache lifetime: 5 minutes, extendable.
OpenAI Prompt Caching: Automatic for prompts over 1,024 tokens. Cached input tokens cost 50% less.

This is the easiest caching win with zero application code changes — it is just a pricing discount applied automatically when your prompts share a long common prefix.

# Requires: anthropic>=0.40.0
# Anthropic prompt caching — explicit cache breakpoint
import anthropic

anthropic_client = anthropic.Anthropic()

response = anthropic_client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for Acme Corp...",
        },
        {
            "type": "text",
            "text": "<entire_product_catalog_here>",  # Large static context
            "cache_control": {"type": "ephemeral"},   # Cache this prefix
        }
    ],
    messages=[{"role": "user", "content": "What are your shipping options?"}],
)

The cache_control breakpoint tells Anthropic to cache everything up to that point. Subsequent requests sharing the same prefix pay 90% less for those tokens.

Batch Processing

Not all LLM calls are time-sensitive. Background summarization, report generation, content classification, and data enrichment jobs can wait minutes or hours. Batch APIs exist specifically for these workloads — at a 50% price reduction.

When to Use Batch APIs

Pattern	Real-time API	Batch API
User waiting for response	Required	Not suitable
Nightly report generation	Wasteful	Ideal
Classifying 100K documents	Slow + expensive	Fast + 50% cheaper
Email content generation	Depends on deadline	Usually suitable
RAG index enrichment	Not needed	Ideal

OpenAI Batch API Pattern

# Requires: openai>=1.30.0
import json
from openai import OpenAI
import time

client = OpenAI()

def submit_batch_job(requests: list[dict]) -> str:
    """Submit a batch of LLM requests. Returns batch_id for status polling."""
    # Create JSONL file
    jsonl_content = "\n".join(
        json.dumps({
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": req["messages"],
                "max_tokens": req.get("max_tokens", 500),
            }
        })
        for i, req in enumerate(requests)
    )

    # Upload input file
    input_file = client.files.create(
        file=("batch_input.jsonl", jsonl_content.encode(), "application/jsonl"),
        purpose="batch",
    )

    # Create batch
    batch = client.batches.create(
        input_file_id=input_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
    )

    return batch.id

def retrieve_batch_results(batch_id: str) -> list[dict]:
    """Poll for completion and return results."""
    while True:
        batch = client.batches.retrieve(batch_id)
        if batch.status == "completed":
            break
        elif batch.status == "failed":
            raise RuntimeError(f"Batch failed: {batch.errors}")
        time.sleep(60)  # Poll every minute

    output_file = client.files.content(batch.output_file_id)
    results = []
    for line in output_file.text.strip().split("\n"):
        result = json.loads(line)
        results.append({
            "id": result["custom_id"],
            "content": result["response"]["body"]["choices"][0]["message"]["content"],
        })

    return results

Queue Architecture for Mixed Workloads

Production systems often need both real-time and batch paths. A queue-based architecture handles both:

User Request
    │
    ├─ is_urgent? → Real-time API (full price)
    │
    └─ not urgent → Job Queue → Batch Processor → Result Store
                                       │
                                 (runs every 4h, 50% cost)

Use a priority queue (Redis, SQS) where urgent jobs bypass the batch accumulator. Non-urgent jobs accumulate until a threshold (time elapsed or batch size) triggers a batch submission.

Cost Monitoring and Budgeting

Optimization strategies only work if you can measure their effect. Cost monitoring closes the loop — without it, you are flying blind and cannot know whether your optimizations are working or when costs are trending toward overage.

Per-Request Cost Tracking

Every LLM call should log token counts and estimated cost:

# Requires: python>=3.10
from dataclasses import dataclass
from datetime import datetime

@dataclass
class LLMCallMetrics:
    request_id: str
    model: str
    input_tokens: int
    output_tokens: int
    cached_tokens: int
    cost_usd: float
    latency_ms: float
    user_id: str | None
    feature: str
    timestamp: datetime

# Pricing table (update when providers change rates)
PRICING = {
    "gpt-4o": {"input": 0.0025, "output": 0.010, "cached_input": 0.00125},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006, "cached_input": 0.000075},
    "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015, "cached_input": 0.0003},
    "claude-3-5-haiku-20241022": {"input": 0.0008, "output": 0.004, "cached_input": 0.00008},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int, cached_tokens: int = 0) -> float:
    pricing = PRICING.get(model, PRICING["gpt-4o"])  # default to conservative estimate
    regular_input = input_tokens - cached_tokens
    cost = (
        regular_input / 1000 * pricing["input"]
        + cached_tokens / 1000 * pricing.get("cached_input", pricing["input"] * 0.1)
        + output_tokens / 1000 * pricing["output"]
    )
    return cost

Budget Alerts and Per-User Limits

For multi-tenant applications, set per-user token budgets to prevent runaway usage:

# Requires: redis>=5.0.0
import redis

class TokenBudgetManager:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_and_consume(
        self,
        user_id: str,
        estimated_tokens: int,
        daily_limit: int = 100_000,
        monthly_limit: int = 2_000_000,
    ) -> bool:
        """Return True if under budget, False if limit exceeded."""
        from datetime import date

        day_key = f"budget:{user_id}:day:{date.today().isoformat()}"
        month_key = f"budget:{user_id}:month:{date.today().strftime('%Y-%m')}"

        daily_used = int(self.redis.get(day_key) or 0)
        monthly_used = int(self.redis.get(month_key) or 0)

        if daily_used + estimated_tokens > daily_limit:
            return False
        if monthly_used + estimated_tokens > monthly_limit:
            return False

        # Consume budget
        pipe = self.redis.pipeline()
        pipe.incrby(day_key, estimated_tokens)
        pipe.expire(day_key, 86400)  # 1 day TTL
        pipe.incrby(month_key, estimated_tokens)
        pipe.expire(month_key, 2_592_000)  # 30 day TTL
        pipe.execute()

        return True

Key Metrics to Track

Align your monitoring with LLMOps evaluation and production observability practices:

Cost per request — broken down by model, feature, and user segment
Cache hit rate — semantic cache hit rate should be a KPI you actively improve
Model distribution — percentage of requests routed to each tier
Tokens per request — input and output separately; rising averages signal prompt drift
Cost per active user per day — the metric that determines unit economics
Budget utilization rate — per-user and per-feature, alerting at 80% threshold

A weekly cost review comparing these metrics against a baseline is the practice that catches the regressions that erode optimizations over time — prompt edits that silently lengthen system prompts, new features that default to expensive models, increasing conversation history depths.

Interview Preparation

LLM cost optimization is a required topic for senior and staff GenAI engineering interviews. Interviewers want to see that you understand the economics of production systems, not just how to build demos. These questions appear frequently in GenAI system design rounds.

Q: A customer support chatbot is costing $50,000/month in API fees. The team cannot cut features or reduce users. What is your optimization plan?

A structured answer covers four parallel tracks. First, instrument the system to measure where tokens are going — system prompt size, retrieved context size, conversation history length, output token counts — broken down by request type. Second, implement model routing: classify incoming requests and route simple queries (FAQ lookups, status checks, yes/no decisions) to a Tier 1 model like GPT-4o-mini. Expect 40-60% of requests to qualify. Third, add semantic caching using an embedding similarity search against previous successful responses. FAQ-style questions typically show 30-50% cache hit rates in support contexts. Fourth, compress the system prompt and prune retrieved context to token budgets. Together, these typically reduce spend to $10,000-$20,000/month — a 60-80% reduction with no user-facing quality change.

Q: How do you decide which model to use for a given request in a production system?

The decision framework has three components. A rule-based pre-filter routes obviously simple or obviously complex requests without needing a classifier. A classifier (itself a cheap model call) handles the ambiguous middle. An escape hatch lets users or features explicitly request a higher-tier model when needed. The classifier uses signals like query length, presence of code, reasoning complexity keywords, and whether structured multi-step output is required. The routing decision is logged alongside the response, enabling offline evaluation of routing accuracy — if cheap-model responses are rated poorly by users, the classifier thresholds need adjustment.

Q: What is the difference between exact match caching and semantic caching, and when would you use each?

Exact match caching stores responses keyed by a hash of the full request (model + messages + temperature). It has zero false positives — it only returns a cached response when the request is identical. It works well for deterministic queries: API documentation lookups, fixed-format reports, admin commands. Semantic caching stores an embedding of the query alongside the response, and retrieves cached responses for new queries above a cosine similarity threshold. It handles natural language variation — the same question phrased differently. It requires careful threshold tuning: too low and you return wrong answers, too high and your hit rate is negligible. In practice, exact match handles system-generated or templated queries, semantic caching handles end-user natural language queries.

Q: Your team added a new feature that does five LLM calls per user session. How do you evaluate the cost impact before shipping?

Before shipping, estimate the per-session token cost by running the five calls against a representative sample of user queries and measuring actual token counts. Multiply by expected daily active users and model pricing. Compare against the current per-session cost baseline. If the projected cost increase is above a threshold (typically 20% of current spend), the feature requires optimization before shipping — either by collapsing calls (can any be merged?), routing to cheaper models, adding caching, or deferring non-critical calls to batch processing. This cost gate should be part of the feature’s definition of done, not an afterthought after launch.

LLM Caching — Semantic, KV, and prompt caching strategies
LLMOps — CI/CD, eval gates, and production deployment
LLM Evaluation — Measure quality alongside cost optimization
Prompt Management — Track prompt changes that affect token usage
Fine-Tuning Guide — Distill large models into cheaper alternatives

Frequently Asked Questions

How can I reduce LLM API costs?

The most effective strategies are: model routing (use cheaper models for simple tasks — saves 40-60%), prompt caching (cache repeated prompt prefixes — saves 70-90% on hits), prompt compression (remove redundant tokens — saves 20-40%), and batch processing (use batch APIs at 50% discount for non-urgent workloads). Combining these strategies can reduce costs by 80-95%.

What is model routing for cost optimization?

Model routing uses a classifier (often a small LLM or rule-based system) to determine the complexity of each request, then routes it to the most cost-effective model. Simple requests go to cheap models like GPT-4o-mini or Claude Haiku. Complex requests go to capable models like GPT-4o or Claude Sonnet. This typically reduces costs by 40-60% while maintaining quality on hard tasks.

What is semantic caching for LLMs?

Semantic caching stores LLM responses and retrieves cached answers for semantically similar (not just identical) queries. It uses embedding similarity to match incoming queries against cached ones. If a new query is similar enough (above a cosine similarity threshold), the cached response is returned instantly — zero API cost and sub-millisecond latency. Learn more in our LLM caching guide.

How much does it cost to run an LLM in production?

Costs vary enormously by model, volume, and optimization. A typical production app handling 10,000 requests/day with GPT-4o costs roughly $500-$1,500/month unoptimized. With model routing (60% to mini), caching (30% hit rate), and batch processing for async jobs, the same app costs $100-$300/month. LLM costs are proportional to tokens processed, and most optimizations reduce token count or model cost per token.

How does prompt compression reduce LLM costs?

Prompt compression strips unnecessary tokens from inputs without degrading output quality. LLMs respond equally well to compressed instruction formats — a 127-token system prompt can often be reduced to 42 tokens (67% fewer) with the same instructions. At scale across 10,000 daily requests, this saves roughly $650/year at GPT-4o pricing on a single system prompt alone.

When should I use batch APIs instead of real-time LLM calls?

Use batch APIs for workloads that are not time-sensitive: nightly report generation, classifying large document sets, email content generation, and RAG index enrichment. Both OpenAI and Anthropic offer batch APIs at a 50% price reduction with a 24-hour completion window. Reserve real-time API calls for user-facing requests where someone is waiting for a response.

What metrics should I track for LLM cost optimization?

Track six key metrics: cost per request broken down by model and feature, semantic cache hit rate as an actively improved KPI, model distribution showing percentage of requests at each tier, tokens per request (input and output separately), cost per active user per day for unit economics, and budget utilization rate per user and feature with alerts at 80% threshold.

How do I set per-user token budgets for LLM applications?

Implement per-user token budgets using a Redis-backed budget manager that tracks daily and monthly token consumption. Before each LLM call, check the estimated token count against the user's remaining budget. If the request would exceed the daily or monthly limit, reject it gracefully. This prevents runaway usage in multi-tenant applications.

What is the difference between exact match caching and semantic caching?

Exact match caching stores responses keyed by a hash of the full request — zero false positives, only returns cached responses for identical requests. It works for deterministic, templated queries. Semantic caching stores query embeddings and retrieves responses for semantically similar queries above a cosine similarity threshold, handling natural language variation. Use exact match for system-generated queries and semantic caching for end-user queries.

Why do LLM API costs spiral in production?

LLM costs compound across four factors: which model you call (GPT-4o costs roughly 30-50x more per token than GPT-4o-mini), how many input tokens you send (system prompt, retrieved context, conversation history), how many output tokens the model generates, and how often you call the API (retries, chain steps, agent loops). A RAG system doing three LLM calls per interaction can reach over $20,000/month at 10,000 requests per day.