LLM Routing — Smart Model Selection for Cost and Quality (2026)
Most production LLM applications send every request to the same model. A customer support bot routes “What are your hours?” to GPT-4o. A code assistant sends a formatting request to Claude Opus. The cost per request is 10-30x what it needed to be. LLM routing fixes this by directing each request to the cheapest model that meets the quality bar — one of the three core levers of LLM cost optimization alongside caching and prompt compression.
Who this is for:
- GenAI engineers who need to reduce API spend without degrading user experience
- Backend engineers integrating multiple LLM providers into a unified routing strategy
- Senior engineers preparing for system design interviews — model routing is required for GenAI architecture rounds
- Engineering managers evaluating multi-model strategies for cost control at scale
1. Why LLM Routing Matters
Section titled “1. Why LLM Routing Matters”Using a single expensive model for every request is the GenAI equivalent of running every SQL query against a fully provisioned data warehouse when most queries could hit a read replica.
The cost gap between model tiers is enormous. GPT-4o costs roughly $2.50 per million input tokens. GPT-4o-mini costs $0.15 — about 17x cheaper. Claude Opus 4 costs $15 per million input tokens. Claude Haiku 3.5 costs $0.80 — about 19x cheaper. For tasks where both models produce equivalent output, every request to the expensive model is wasted spend.
Most production requests are simple. In real-world LLM applications, 50-70% of requests are simple enough for cheap models: text classification, entity extraction, formatting, simple Q&A from retrieved context, and template filling. Only 20-30% genuinely require frontier-model reasoning.
The quality difference on simple tasks is negligible. On classification, extraction, and simple generation benchmarks, the gap between Tier 1 (mini/haiku) and Tier 3 (full/opus) models is under 2%.
Routing pays for itself immediately. If 60% of your traffic routes to a model that costs 15x less, your blended cost per request drops by roughly 50%. For a system processing 10,000 requests per day at GPT-4o pricing, that is a reduction from roughly $1,500/month to $750/month on routing alone.
2. When to Implement LLM Routing
Section titled “2. When to Implement LLM Routing”Not every application needs a routing layer. Here is the decision framework.
Decision Table
Section titled “Decision Table”| Signal | No Routing Needed | Routing Recommended |
|---|---|---|
| Daily request volume | <500 requests | >1,000 requests |
| Monthly LLM spend | <$100 | >$500 |
| Request diversity | All requests same type | Mix of simple and complex |
| Latency sensitivity | Uniform latency OK | Some requests need <500ms |
| Model providers | Single provider | Multiple providers |
Three triggers for adoption: (1) Monthly API bill exceeds $500 with a mix of simple and complex requests. (2) Latency requirements differ by feature — autocomplete needs sub-second, analysis can tolerate 2-3s. (3) Multi-provider resilience — fail over between OpenAI, Anthropic, and open-source models.
When to skip: If all requests are the same type and complexity. Routing adds a classifier to maintain and a new failure mode (misclassification).
3. How LLM Routing Works — Architecture
Section titled “3. How LLM Routing Works — Architecture”The routing system sits between your application and the model providers. Every request passes through classification, model selection, execution, and validation.
LLM Routing Pipeline
Each request is classified by complexity, routed to the cheapest capable model, and validated before returning to the caller.
How the Pipeline Works
Section titled “How the Pipeline Works”- Request intake — the application sends the query with metadata (feature, user tier, quality preferences).
- Complexity classification — rule-based (fast, free) or ML-based (~50ms, more accurate for ambiguous cases).
- Model selection — maps the complexity tier to a model, factoring in cost budget, latency, and provider availability.
- LLM execution — calls the selected model with fallback: retry once on error, then escalate to the next tier.
- Response validation — for cascading routers, checks quality and escalates automatically if below threshold.
4. LLM Routing Tutorial — Build a Router in Python
Section titled “4. LLM Routing Tutorial — Build a Router in Python”Build a production-ready router that classifies query complexity, selects the optimal model, and implements fallback chains.
Step 1: Define the Model Registry
Section titled “Step 1: Define the Model Registry”from dataclasses import dataclassfrom enum import Enum
class ComplexityTier(Enum): SIMPLE = "simple" # Classification, extraction, formatting MODERATE = "moderate" # Summarization, simple Q&A, template filling COMPLEX = "complex" # Multi-step reasoning, code generation, synthesis
@dataclassclass ModelConfig: model_id: str provider: str cost_per_1k_input: float cost_per_1k_output: float max_tokens: int avg_latency_ms: int
MODEL_REGISTRY: dict[ComplexityTier, list[ModelConfig]] = { ComplexityTier.SIMPLE: [ ModelConfig("gpt-4o-mini", "openai", 0.00015, 0.0006, 16384, 300), ModelConfig("claude-3-5-haiku-20241022", "anthropic", 0.0008, 0.004, 8192, 400), ], ComplexityTier.MODERATE: [ ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 16384, 800), ], ComplexityTier.COMPLEX: [ ModelConfig("claude-opus-4-20250514", "anthropic", 0.015, 0.075, 4096, 2000), ],}Step 2: Build the Complexity Classifier
Section titled “Step 2: Build the Complexity Classifier”import re
def classify_complexity(query: str, metadata: dict | None = None) -> ComplexityTier: """Rule-based complexity classifier.""" metadata = metadata or {} feature = metadata.get("feature", "")
# Feature-level overrides if feature in ("classification", "extraction", "formatting"): return ComplexityTier.SIMPLE if feature in ("code_generation", "research", "multi_step_analysis"): return ComplexityTier.COMPLEX
query_lower = query.lower() word_count = len(query.split())
if word_count < 20 and not any(kw in query_lower for kw in [ "explain", "compare", "analyze", "write code", "debug", "design" ]): return ComplexityTier.SIMPLE
if re.search(r"```|def |class |function |import ", query): return ComplexityTier.COMPLEX
complex_signals = ["step by step", "compare and contrast", "architecture", "design a system", "write a complete", "refactor"] if any(signal in query_lower for signal in complex_signals): return ComplexityTier.COMPLEX
return ComplexityTier.MODERATEStep 3: Implement the Router with Fallback
Section titled “Step 3: Implement the Router with Fallback”import asyncioimport logging
logger = logging.getLogger(__name__)
class LLMRouter: def __init__(self, registry: dict[ComplexityTier, list[ModelConfig]]): self.registry = registry self.clients = {} # provider -> client mapping
def register_client(self, provider: str, client) -> None: self.clients[provider] = client
async def route(self, query: str, metadata: dict | None = None) -> dict: """Route a query to the optimal model with fallback.""" tier = classify_complexity(query, metadata)
for model in self.registry.get(tier, []): client = self.clients.get(model.provider) if not client: continue try: response = await client.generate( model=model.model_id, messages=[{"role": "user", "content": query}], ) return { "content": response.content, "model": model.model_id, "tier": tier.value, "provider": model.provider, } except Exception as e: logger.warning(f"{model.model_id} failed: {e}")
# Escalate to next tier on failure next_tier = _escalate_tier(tier) if next_tier: return await self.route(query, metadata) raise RuntimeError(f"All models failed for tier {tier.value}")
def _escalate_tier(tier: ComplexityTier) -> ComplexityTier | None: return { ComplexityTier.SIMPLE: ComplexityTier.MODERATE, ComplexityTier.MODERATE: ComplexityTier.COMPLEX, }.get(tier)Step 4: Wire It Up
Section titled “Step 4: Wire It Up”router = LLMRouter(MODEL_REGISTRY)router.register_client("openai", openai_client)router.register_client("anthropic", anthropic_client)
# Simple query → GPT-4o-mini or Haikuresult = await router.route("What is the capital of France?")
# Complex query → GPT-4o or Opusresult = await router.route( "Design a distributed caching system for an LLM gateway", metadata={"feature": "code_generation"})5. Routing Architecture Patterns
Section titled “5. Routing Architecture Patterns”Production routing systems share a layered architecture. Each layer has a clear responsibility.
LLM Routing Gateway Architecture
Six layers from application request to cost tracking — each layer is independently testable and replaceable
Layer Responsibilities
Section titled “Layer Responsibilities”Application Layer. Attaches routing metadata to each request: feature, user tier (free vs. paid), and quality preferences.
Gateway. Rate limiting, authentication, and request queuing. If you use Portkey or LiteLLM, this layer is built in.
Complexity Classifier. Takes query text and metadata, outputs a complexity tier. In simple deployments, this is 50 lines of rule-based Python. In sophisticated deployments, a small ML model trained on historical routing outcomes.
Model Registry. Maps complexity tiers to model lists with pricing, latency, and availability. Should be hot-reloadable — update without redeploying when providers change pricing.
Provider Adapters. Unified interface over OpenAI, Anthropic, and self-hosted APIs. Each adapter handles auth, formatting, parsing, and retries.
Cost Tracker. Logs model, token counts, latency, and cost per request. Feeds dashboards, budget alerts, and the classifier feedback loop.
6. LLM Routing Code Examples
Section titled “6. LLM Routing Code Examples”Three patterns cover most routing needs: rule-based for simplicity, ML-based for accuracy, and cascading for quality guarantees.
Example 1: Rule-Based Router
Section titled “Example 1: Rule-Based Router”Pure logic based on request metadata and query characteristics. Zero latency, zero cost overhead.
class RuleBasedRouter: """Route based on deterministic rules. Zero classifier overhead."""
SIMPLE_FEATURES = {"faq", "status_check", "classification", "extraction"} COMPLEX_FEATURES = {"code_review", "research_synthesis", "creative_writing"}
def __init__(self, simple_model: str, moderate_model: str, complex_model: str): self.models = { ComplexityTier.SIMPLE: simple_model, ComplexityTier.MODERATE: moderate_model, ComplexityTier.COMPLEX: complex_model, }
def select_model(self, query: str, feature: str) -> str: if feature in self.SIMPLE_FEATURES: return self.models[ComplexityTier.SIMPLE] if feature in self.COMPLEX_FEATURES: return self.models[ComplexityTier.COMPLEX] word_count = len(query.split()) if word_count < 30: return self.models[ComplexityTier.SIMPLE] if word_count > 200: return self.models[ComplexityTier.COMPLEX] return self.models[ComplexityTier.MODERATE]
router = RuleBasedRouter("gpt-4o-mini", "gpt-4o", "gpt-4o")model = router.select_model("What are your hours?", feature="faq")# Returns "gpt-4o-mini"When to use: Early-stage applications and predictable request patterns. Covers 80% of the value with 20% of the complexity.
Example 2: ML-Based Classifier Router
Section titled “Example 2: ML-Based Classifier Router”Uses a small LLM to classify complexity before routing. More accurate than rules for ambiguous requests. Adds ~50-100ms latency and one cheap model call ($0.00001 per classification).
import json
class MLClassifierRouter: """Use a small LLM to classify query complexity."""
CLASSIFIER_PROMPT = """Classify this query's complexity for LLM routing.- SIMPLE: factual lookup, classification, extraction, formatting- MODERATE: summarization, simple Q&A, template generation- COMPLEX: multi-step reasoning, code generation, system design
Query: {query}Return JSON: {{"tier": "SIMPLE"|"MODERATE"|"COMPLEX"}}"""
def __init__(self, classifier_client, model: str = "gpt-4o-mini"): self.client = classifier_client self.model = model
async def classify(self, query: str) -> ComplexityTier: response = await self.client.generate( model=self.model, messages=[{"role": "user", "content": self.CLASSIFIER_PROMPT.format(query=query)}], max_tokens=50, temperature=0, ) try: tier_str = json.loads(response.content).get("tier", "MODERATE") return ComplexityTier(tier_str.lower()) except (json.JSONDecodeError, ValueError): return ComplexityTier.MODERATEWhen to use: Applications with diverse query types where rule-based classification misroutes more than 15% of requests.
Example 3: Cascading Router
Section titled “Example 3: Cascading Router”Send every request to the cheapest model first. If response quality is insufficient, escalate to a more capable model.
class CascadingRouter: """Try cheap model first, escalate if confidence is low."""
def __init__(self, tiers: list[ModelConfig], confidence_threshold: float = 0.7): self.tiers = tiers # Ordered cheapest to most expensive self.confidence_threshold = confidence_threshold
async def route(self, query: str, client_map: dict) -> dict: for i, model in enumerate(self.tiers): client = client_map.get(model.provider) if not client: continue
response = await client.generate( model=model.model_id, messages=[{"role": "user", "content": query}], )
# Last tier or confidence passes — return if i == len(self.tiers) - 1: return {"content": response.content, "model": model.model_id, "escalated": i > 0}
confidence = self._evaluate_confidence(response.content) if confidence >= self.confidence_threshold: return {"content": response.content, "model": model.model_id, "escalated": False}
raise RuntimeError("All tiers exhausted")
def _evaluate_confidence(self, content: str) -> float: """Heuristic quality check — length, hedging, structure.""" score = 0.5 if len(content.split()) > 20: score += 0.15 hedging = ["i'm not sure", "i don't know", "i cannot"] if not any(h in content.lower() for h in hedging): score += 0.15 if any(m in content for m in ["```", "1.", "- "]): score += 0.1 if content.rstrip().endswith((".", "```", ")")): score += 0.1 return min(score, 1.0)When to use: Quality-critical applications (medical, legal, financial). Tradeoff: higher latency on escalated requests (two model calls).
7. Static vs Dynamic Routing
Section titled “7. Static vs Dynamic Routing”The two strategies differ in when the model decision is made — at configuration time or at request time.
Static vs Dynamic LLM Routing
- Model assigned by endpoint, feature flag, or request type
- Zero classifier latency — no computation per request
- Easy to debug — deterministic model selection
- Configuration changes require deployment
- Cannot adapt to query-level complexity differences
- Misses savings on simple queries within complex features
- Best for applications with clearly separated use cases
- Each request evaluated individually for complexity
- Achieves 40-70% cost savings on mixed workloads
- Adapts in real-time to query characteristics
- Classifier adds 50-100ms latency per request
- Misclassification risk — wrong model for some queries
- Requires logging and evaluation infrastructure
- Best for diverse workloads with mixed complexity
Hybrid Approach
Section titled “Hybrid Approach”Most mature systems combine both. Features with predictable complexity get static routing — your classification endpoint always uses GPT-4o-mini, your code review endpoint always uses GPT-4o. General-purpose endpoints use dynamic classification because query complexity varies per request. Start static. Add dynamic routing only where the data shows mixed complexity patterns.
8. LLM Routing Interview Questions
Section titled “8. LLM Routing Interview Questions”Model routing appears frequently in GenAI system design interviews — it tests production cost management, not just API usage.
Q: Your company’s LLM costs are $30,000/month. The CEO asks you to cut costs by 50% without reducing quality. What is your plan?
A: Start with instrumentation — log model, token counts, and quality scores for every request. Analyze the distribution: in most systems, 50-70% of requests are simple. Implement three-tier routing: simple requests to GPT-4o-mini (17x cheaper), moderate to GPT-4o, complex to the frontier model. If 60% routes to Tier 1, blended cost drops by roughly 50%. Add semantic caching (30-50% hit rate in support contexts) and prompt compression. Together, these achieve 60-80% cost reduction.
Q: How do you ensure the routing classifier does not degrade output quality?
A: Two mechanisms. First, offline evaluation: run a golden dataset through the classifier and measure misclassification rate — if more than 10% of complex queries get routed to Tier 1, the classifier needs tuning. Second, online monitoring: log routing decisions alongside quality scores from your evaluation pipeline. Track quality by tier, set alerts on degradation, and review weekly.
Q: When would you use a cascading router vs. a classifier-based router?
A: Cascading routers suit quality-critical applications (medical, legal, financial) where the cheap model tries first and escalates if confidence is low. Classifier-based routers suit latency-sensitive applications (chatbots, customer support) where a single model call per request is required and double-calling is too slow.
Q: A teammate proposes always using the cheapest model and only escalating when users complain. What is wrong with this approach?
A: Three problems. First, users receive degraded responses until they complain — and most do not complain, they leave. Second, no signal about silent quality degradation. Third, a reactive loop instead of a proactive one. The correct approach: measure quality per tier before shipping, set escalation thresholds based on automated evaluation, and monitor continuously. User complaints are a lagging indicator.
9. LLM Routing in Production
Section titled “9. LLM Routing in Production”Production routing requires monitoring, A/B testing, and integration with the broader LLMOps stack.
Cost Savings Metrics
Section titled “Cost Savings Metrics”- Routing distribution — percentage per tier. Target: 50-70% at Tier 1 for mixed workloads.
- Cost per request — by tier, feature, and user segment vs. the “no routing” baseline.
- Quality by tier — average quality score per tier. Tier 1 must stay above your minimum threshold.
- Misclassification rate — percentage of requests where the routed model produced below-threshold output.
- Routing overhead — rule-based: <1ms. ML-based: 50-100ms. Must stay within your latency budget.
A/B Testing Model Changes
Section titled “A/B Testing Model Changes”Before changing routing configuration, run an A/B test: control group on existing config, treatment on new config. Measure quality score delta, cost delta, and latency delta across 1,000-5,000 requests per group over at least 7 days. Use consistent hashing on user ID to keep each user on the same config during the test.
Production Tools
Section titled “Production Tools”LiteLLM — Open-source proxy providing a unified API across 100+ LLM providers with routing, fallback, and spend tracking. Self-hosted or cloud.
Portkey — AI gateway with routing, automatic retries, caching, and observability dashboards. Cloud-hosted with a generous free tier.
Martian Model Router — ML-based service that selects the optimal model per request, learning from historical quality data.
Custom routing — Build with the OpenAI and Anthropic SDKs, a classifier function, and a model registry config. Most teams start here and migrate to a managed gateway at 50,000+ requests per day.
Monitoring Checklist
Section titled “Monitoring Checklist”- Cost dashboard by model, tier, feature, and day
- Quality scores per tier with degradation alerts
- Routing distribution chart (sudden shifts = classifier drift)
- Fallback rate tracking (frequent fallbacks = provider issues)
- Latency percentiles (p50, p95, p99) per tier
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”LLM routing is the highest-leverage cost optimization for production GenAI systems. In most applications, 50-70% of queries are simple enough for models that cost 10-30x less.
Start with rule-based routing. Map features to complexity tiers. Route known-simple features to cheap models. This alone cuts costs by 30-50%.
Add dynamic classification for diverse workloads. A classifier that evaluates each request individually captures savings that static routing misses.
Use cascading routers for quality-critical applications. Try the cheap model first, escalate if confidence is low. This guarantees quality parity while maximizing savings.
Measure everything. Log routing decisions, track quality by tier, and review misclassification rates weekly.
Combine routing with caching and prompt compression. Routing selects the cheapest model. Caching eliminates calls on hits. Prompt compression reduces tokens per call. Together, these achieve 70-90% cost reduction.
Related
Section titled “Related”- LLM Cost Optimization — Complete cost reduction playbook
- LLMOps — CI/CD and deployment patterns for production LLM systems
- LLM Evaluation — Measuring quality per model tier
- Cloud AI Platforms — Provider comparison for multi-provider routing
- LLM Caching — Semantic, KV, and prompt caching strategies
Frequently Asked Questions
What is LLM routing?
LLM routing directs each API request to the most cost-effective model that can handle it at the required quality level. A classifier evaluates query complexity — simple tasks go to cheap models (GPT-4o-mini, Claude Haiku), while complex reasoning goes to capable models (GPT-4o, Claude Opus). This typically reduces API costs by 40-70% without degrading output quality. See our LLM cost optimization guide for the complete cost reduction playbook.
How does LLM routing reduce costs?
LLM routing prevents expensive model calls for tasks that cheaper models handle equally well. In a typical production workload, 50-70% of requests are simple enough for a Tier 1 model that costs 10-30x less per token. By routing only complex requests to expensive models, total API spend drops by 40-70%.
What is the difference between static and dynamic LLM routing?
Static routing assigns models based on fixed rules — endpoint, feature flag, or request type. Dynamic routing uses a classifier to evaluate each request at runtime and select the best model based on complexity. Static is predictable and zero-overhead; dynamic achieves higher savings but adds latency. Most production systems use a hybrid of both.
How do I classify query complexity for model routing?
Start with rule-based signals: query length, presence of code blocks, request type from your application, and whether structured output is required. For more accuracy, use a small LLM (GPT-4o-mini) as the classifier — it evaluates the query and returns a complexity tier in under 100ms at minimal cost. Log routing decisions alongside quality scores to continuously improve accuracy.
What is a cascading router for LLMs?
A cascading router sends every request to the cheapest model first, then evaluates the response confidence. If quality meets the threshold, the cheap response is returned. If confidence is low, the request escalates to the next tier. This guarantees quality parity with the expensive model while achieving cost savings proportional to the cheap model's success rate.
What tools support LLM routing in production?
LiteLLM provides a unified API across 100+ providers with built-in routing. Portkey offers an AI gateway with automatic retries and model fallback. Martian uses ML to select the optimal model per request. For custom routing, most teams build a thin layer using the OpenAI and Anthropic SDKs with a classifier function.
How do I handle fallback when a routed model fails?
Implement a fallback chain: if the selected model returns an error, retry once, then escalate to the next tier. For rate limits, maintain a token bucket per provider and pre-route away from providers approaching their limit. Log every fallback event — frequent fallbacks indicate capacity issues that need architectural attention.
What metrics should I track for an LLM routing system?
Track five metrics: routing distribution (percentage per tier — aim for 50-70% at Tier 1), cost per request by tier, quality scores by tier, misclassification rate (requests routed to cheap models that produced low-quality output), and routing latency overhead (should be under 50ms). Review weekly to tune classifier thresholds.
Can I use LLM routing with RAG systems?
Yes — use retrieval results as a routing signal. If retrieved chunks have high relevance scores and the query is a direct factual lookup, route to a cheap model. If chunks have low relevance or the query requires synthesis, route to a capable model. This routes 40-60% of RAG queries to Tier 1 models.
How does LLM routing differ from load balancing?
Load balancing distributes requests across identical instances of the same model for throughput. LLM routing selects which model to call based on request characteristics. In production, you need both: routing selects the model, then load balancing distributes across that model's instances.