LLM routing is the practice of directing each API request to the most cost-effective model that can handle it at the required quality level. A classifier evaluates query complexity — simple tasks like extraction or formatting go to cheap models (GPT-4o-mini, Claude Haiku), while complex reasoning or code generation goes to capable models (GPT-4o, Claude Sonnet/Opus). This typically reduces API costs by 40-70% without degrading output quality on hard tasks.

LLM Routing — Smart Model Selection for Cost and Quality (2026)

Q: How does LLM routing reduce costs?

LLM routing reduces costs by preventing expensive model calls for tasks that cheaper models handle equally well. In a typical production workload, 50-70% of requests are simple enough for a Tier 1 model that costs 10-30x less per token than a Tier 3 model. By routing only complex requests to expensive models, total API spend drops by 40-70%. Combined with caching and prompt compression, routing is one of the three pillars of LLM cost optimization.

Q: What is the difference between static and dynamic LLM routing?

Static routing assigns models based on fixed rules — endpoint, feature flag, or request type. It is predictable, easy to debug, and requires no classifier overhead. Dynamic routing uses a classifier (rule-based or ML) to evaluate each request at runtime and select the best model based on complexity, cost budget, and latency requirements. Dynamic routing achieves higher cost savings but adds latency and introduces a new failure mode (misclassification).

Q: How do I classify query complexity for model routing?

Start with rule-based signals: query length (short queries are usually simple), presence of code blocks or multi-step instructions (indicates complexity), request type from your application (classification vs. creative writing), and whether structured output is required. For more accuracy, use a small LLM (GPT-4o-mini or Claude Haiku) as the classifier itself — it evaluates the query and returns a complexity tier in under 100ms at minimal cost. Log routing decisions alongside quality scores to continuously improve classifier accuracy.

Q: What is a cascading router for LLMs?

A cascading router sends every request to the cheapest model first, then evaluates the response confidence. If the cheap model's response meets a quality threshold (based on log probabilities, self-evaluation, or output structure validation), it is returned immediately. If confidence is low, the request escalates to the next tier. This pattern guarantees quality parity with the expensive model while achieving cost savings proportional to the cheap model's success rate — typically 40-60% of requests resolve at Tier 1.

Q: What tools support LLM routing in production?

Several tools support production LLM routing: LiteLLM provides a unified API across 100+ LLM providers with built-in routing and fallback logic. Portkey offers an AI gateway with automatic retries, load balancing, and model fallback. Martian's Model Router uses ML to select the optimal model per request. For custom routing, most teams build a thin routing layer using the OpenAI and Anthropic Python SDKs with a classifier function that maps complexity tiers to model identifiers.

Q: How do I handle fallback when a routed model fails?

Implement a fallback chain: if the selected model returns an error (rate limit, timeout, server error), retry once, then escalate to the next tier. For rate limits specifically, maintain a token bucket per provider and pre-route away from providers approaching their limit. Log every fallback event — frequent fallbacks to a specific model indicate capacity issues that need architectural attention, not just retry logic.

Q: What metrics should I track for an LLM routing system?

Track five key metrics: routing distribution (percentage of requests per model tier — aim for 50-70% at Tier 1), cost per request by tier and overall, quality scores by tier (ensure cheap model quality meets your threshold), misclassification rate (requests routed to a cheap model that produced low-quality output), and routing latency overhead (the classifier itself should add less than 50ms). Review these weekly to tune classifier thresholds.

Q: Can I use LLM routing with RAG systems?

Yes, and RAG systems benefit significantly from routing. Use the retrieval results as a routing signal: if retrieved chunks have high relevance scores and the query is a direct factual lookup, route to a cheap model. If chunks have low relevance or the query requires synthesis across multiple documents, route to a capable model. This approach routes 40-60% of RAG queries to Tier 1 models because many RAG queries are simple lookups against well-indexed content.

Q: How does LLM routing differ from load balancing?

Load balancing distributes requests across identical instances of the same model to manage throughput and availability. LLM routing selects which model to call based on the request's characteristics — complexity, cost budget, latency requirement. In production, you need both: routing selects the model, then load balancing distributes requests across that model's available instances. Some tools like LiteLLM and Portkey combine both capabilities in a single gateway.

Most production LLM applications send every request to the same model. A customer support bot routes “What are your hours?” to GPT-4o. A code assistant sends a formatting request to Claude Opus. The cost per request is 10-30x what it needed to be. LLM routing fixes this by directing each request to the cheapest model that meets the quality bar — one of the three core levers of LLM cost optimization alongside caching and prompt compression.

Who this is for:

GenAI engineers who need to reduce API spend without degrading user experience
Backend engineers integrating multiple LLM providers into a unified routing strategy
Senior engineers preparing for system design interviews — model routing is required for GenAI architecture rounds
Engineering managers evaluating multi-model strategies for cost control at scale

1. Why LLM Routing Matters

Using a single expensive model for every request is the GenAI equivalent of running every SQL query against a fully provisioned data warehouse when most queries could hit a read replica.

The cost gap between model tiers is enormous. GPT-4o costs roughly $2.50 per million input tokens. GPT-4o-mini costs $0.15 — about 17x cheaper. Claude Opus 4 costs $15 per million input tokens. Claude Haiku 3.5 costs $0.80 — about 19x cheaper. For tasks where both models produce equivalent output, every request to the expensive model is wasted spend.

Most production requests are simple. In real-world LLM applications, 50-70% of requests are simple enough for cheap models: text classification, entity extraction, formatting, simple Q&A from retrieved context, and template filling. Only 20-30% genuinely require frontier-model reasoning.

The quality difference on simple tasks is negligible. On classification, extraction, and simple generation benchmarks, the gap between Tier 1 (mini/haiku) and Tier 3 (full/opus) models is under 2%.

Routing pays for itself immediately. If 60% of your traffic routes to a model that costs 15x less, your blended cost per request drops by roughly 50%. For a system processing 10,000 requests per day at GPT-4o pricing, that is a reduction from roughly $1,500/month to $750/month on routing alone.

2. When to Implement LLM Routing

Not every application needs a routing layer. Here is the decision framework.

Decision Table

Signal	No Routing Needed	Routing Recommended
Daily request volume	<500 requests	>1,000 requests
Monthly LLM spend	<$100	>$500
Request diversity	All requests same type	Mix of simple and complex
Latency sensitivity	Uniform latency OK	Some requests need <500ms
Model providers	Single provider	Multiple providers

Three triggers for adoption: (1) Monthly API bill exceeds $500 with a mix of simple and complex requests. (2) Latency requirements differ by feature — autocomplete needs sub-second, analysis can tolerate 2-3s. (3) Multi-provider resilience — fail over between OpenAI, Anthropic, and open-source models.

When to skip: If all requests are the same type and complexity. Routing adds a classifier to maintain and a new failure mode (misclassification).

3. How LLM Routing Works — Architecture

The routing system sits between your application and the model providers. Every request passes through classification, model selection, execution, and validation.

LLM Routing Pipeline

Each request is classified by complexity, routed to the cheapest capable model, and validated before returning to the caller.

Inbound RequestRaw query from application layer

User query received

Request metadata extracted

Feature flags checked

Complexity ClassifierEvaluate difficulty — rule-based or ML

Token count analysis

Task type detection

Complexity score assigned

Model SelectorMap complexity tier to optimal model

Check cost budget

Check latency requirement

Select model + provider

LLM PoolExecute against selected model with fallback

Tier 1: mini / haiku

Tier 2: standard / sonnet

Tier 3: full / opus

Response ValidatorVerify quality meets threshold before returning

Structure check

Confidence evaluation

Escalate if below threshold

Idle

How the Pipeline Works

Request intake — the application sends the query with metadata (feature, user tier, quality preferences).
Complexity classification — rule-based (fast, free) or ML-based (~50ms, more accurate for ambiguous cases).
Model selection — maps the complexity tier to a model, factoring in cost budget, latency, and provider availability.
LLM execution — calls the selected model with fallback: retry once on error, then escalate to the next tier.
Response validation — for cascading routers, checks quality and escalates automatically if below threshold.

4. LLM Routing Tutorial — Build a Router in Python

Build a production-ready router that classifies query complexity, selects the optimal model, and implements fallback chains.

Step 1: Define the Model Registry

from dataclasses import dataclass
from enum import Enum

class ComplexityTier(Enum):
    SIMPLE = "simple"       # Classification, extraction, formatting
    MODERATE = "moderate"    # Summarization, simple Q&A, template filling
    COMPLEX = "complex"      # Multi-step reasoning, code generation, synthesis

@dataclass
class ModelConfig:
    model_id: str
    provider: str
    cost_per_1k_input: float
    cost_per_1k_output: float
    max_tokens: int
    avg_latency_ms: int

MODEL_REGISTRY: dict[ComplexityTier, list[ModelConfig]] = {
    ComplexityTier.SIMPLE: [
        ModelConfig("gpt-4o-mini", "openai", 0.00015, 0.0006, 16384, 300),
        ModelConfig("claude-3-5-haiku-20241022", "anthropic", 0.0008, 0.004, 8192, 400),
    ],
    ComplexityTier.MODERATE: [
        ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 16384, 800),
    ],
    ComplexityTier.COMPLEX: [
        ModelConfig("claude-opus-4-20250514", "anthropic", 0.015, 0.075, 4096, 2000),
    ],
}

Step 2: Build the Complexity Classifier

import re

def classify_complexity(query: str, metadata: dict | None = None) -> ComplexityTier:
    """Rule-based complexity classifier."""
    metadata = metadata or {}
    feature = metadata.get("feature", "")

    # Feature-level overrides
    if feature in ("classification", "extraction", "formatting"):
        return ComplexityTier.SIMPLE
    if feature in ("code_generation", "research", "multi_step_analysis"):
        return ComplexityTier.COMPLEX

    query_lower = query.lower()
    word_count = len(query.split())

    if word_count < 20 and not any(kw in query_lower for kw in [
        "explain", "compare", "analyze", "write code", "debug", "design"
    ]):
        return ComplexityTier.SIMPLE

    if re.search(r"```|def |class |function |import ", query):
        return ComplexityTier.COMPLEX

    complex_signals = ["step by step", "compare and contrast", "architecture",
                       "design a system", "write a complete", "refactor"]
    if any(signal in query_lower for signal in complex_signals):
        return ComplexityTier.COMPLEX

    return ComplexityTier.MODERATE

Step 3: Implement the Router with Fallback

import asyncio
import logging

logger = logging.getLogger(__name__)

class LLMRouter:
    def __init__(self, registry: dict[ComplexityTier, list[ModelConfig]]):
        self.registry = registry
        self.clients = {}  # provider -> client mapping

    def register_client(self, provider: str, client) -> None:
        self.clients[provider] = client

    async def route(self, query: str, metadata: dict | None = None) -> dict:
        """Route a query to the optimal model with fallback."""
        tier = classify_complexity(query, metadata)

        for model in self.registry.get(tier, []):
            client = self.clients.get(model.provider)
            if not client:
                continue
            try:
                response = await client.generate(
                    model=model.model_id,
                    messages=[{"role": "user", "content": query}],
                )
                return {
                    "content": response.content,
                    "model": model.model_id,
                    "tier": tier.value,
                    "provider": model.provider,
                }
            except Exception as e:
                logger.warning(f"{model.model_id} failed: {e}")

        # Escalate to next tier on failure
        next_tier = _escalate_tier(tier)
        if next_tier:
            return await self.route(query, metadata)
        raise RuntimeError(f"All models failed for tier {tier.value}")

def _escalate_tier(tier: ComplexityTier) -> ComplexityTier | None:
    return {
        ComplexityTier.SIMPLE: ComplexityTier.MODERATE,
        ComplexityTier.MODERATE: ComplexityTier.COMPLEX,
    }.get(tier)

Step 4: Wire It Up

router = LLMRouter(MODEL_REGISTRY)
router.register_client("openai", openai_client)
router.register_client("anthropic", anthropic_client)

# Simple query → GPT-4o-mini or Haiku
result = await router.route("What is the capital of France?")

# Complex query → GPT-4o or Opus
result = await router.route(
    "Design a distributed caching system for an LLM gateway",
    metadata={"feature": "code_generation"}
)

5. Routing Architecture Patterns

Production routing systems share a layered architecture. Each layer has a clear responsibility.

LLM Routing Gateway Architecture

Six layers from application request to cost tracking — each layer is independently testable and replaceable

Application Layer

API endpoints, feature flags, user context — generates routing metadata per request

Gateway / Load Balancer

Rate limiting, authentication, request queuing — distributes across routing instances

Complexity Classifier

Rule-based or ML classifier — assigns complexity tier and confidence score

Model Registry

Maps tiers to models with pricing, latency, and availability metadata — hot-reloadable config

Provider Adapters

Unified interface over OpenAI, Anthropic, and self-hosted models — handles auth, retries, format translation

Cost Tracker

Logs tokens, model, latency per request — feeds dashboards and budget alerts

Idle

Layer Responsibilities

Application Layer. Attaches routing metadata to each request: feature, user tier (free vs. paid), and quality preferences.

Gateway. Rate limiting, authentication, and request queuing. If you use Portkey or LiteLLM, this layer is built in.

Complexity Classifier. Takes query text and metadata, outputs a complexity tier. In simple deployments, this is 50 lines of rule-based Python. In sophisticated deployments, a small ML model trained on historical routing outcomes.

Model Registry. Maps complexity tiers to model lists with pricing, latency, and availability. Should be hot-reloadable — update without redeploying when providers change pricing.

Provider Adapters. Unified interface over OpenAI, Anthropic, and self-hosted APIs. Each adapter handles auth, formatting, parsing, and retries.

Cost Tracker. Logs model, token counts, latency, and cost per request. Feeds dashboards, budget alerts, and the classifier feedback loop.

6. LLM Routing Code Examples

Three patterns cover most routing needs: rule-based for simplicity, ML-based for accuracy, and cascading for quality guarantees.

Example 1: Rule-Based Router

Pure logic based on request metadata and query characteristics. Zero latency, zero cost overhead.

class RuleBasedRouter:
    """Route based on deterministic rules. Zero classifier overhead."""

    SIMPLE_FEATURES = {"faq", "status_check", "classification", "extraction"}
    COMPLEX_FEATURES = {"code_review", "research_synthesis", "creative_writing"}

    def __init__(self, simple_model: str, moderate_model: str, complex_model: str):
        self.models = {
            ComplexityTier.SIMPLE: simple_model,
            ComplexityTier.MODERATE: moderate_model,
            ComplexityTier.COMPLEX: complex_model,
        }

    def select_model(self, query: str, feature: str) -> str:
        if feature in self.SIMPLE_FEATURES:
            return self.models[ComplexityTier.SIMPLE]
        if feature in self.COMPLEX_FEATURES:
            return self.models[ComplexityTier.COMPLEX]
        word_count = len(query.split())
        if word_count < 30:
            return self.models[ComplexityTier.SIMPLE]
        if word_count > 200:
            return self.models[ComplexityTier.COMPLEX]
        return self.models[ComplexityTier.MODERATE]

router = RuleBasedRouter("gpt-4o-mini", "gpt-4o", "gpt-4o")
model = router.select_model("What are your hours?", feature="faq")
# Returns "gpt-4o-mini"

When to use: Early-stage applications and predictable request patterns. Covers 80% of the value with 20% of the complexity.

Example 2: ML-Based Classifier Router

Uses a small LLM to classify complexity before routing. More accurate than rules for ambiguous requests. Adds ~50-100ms latency and one cheap model call ($0.00001 per classification).

import json

class MLClassifierRouter:
    """Use a small LLM to classify query complexity."""

    CLASSIFIER_PROMPT = """Classify this query's complexity for LLM routing.
- SIMPLE: factual lookup, classification, extraction, formatting
- MODERATE: summarization, simple Q&A, template generation
- COMPLEX: multi-step reasoning, code generation, system design

Query: {query}
Return JSON: {{"tier": "SIMPLE"|"MODERATE"|"COMPLEX"}}"""

    def __init__(self, classifier_client, model: str = "gpt-4o-mini"):
        self.client = classifier_client
        self.model = model

    async def classify(self, query: str) -> ComplexityTier:
        response = await self.client.generate(
            model=self.model,
            messages=[{"role": "user", "content": self.CLASSIFIER_PROMPT.format(query=query)}],
            max_tokens=50, temperature=0,
        )
        try:
            tier_str = json.loads(response.content).get("tier", "MODERATE")
            return ComplexityTier(tier_str.lower())
        except (json.JSONDecodeError, ValueError):
            return ComplexityTier.MODERATE

When to use: Applications with diverse query types where rule-based classification misroutes more than 15% of requests.

Example 3: Cascading Router

Send every request to the cheapest model first. If response quality is insufficient, escalate to a more capable model.

class CascadingRouter:
    """Try cheap model first, escalate if confidence is low."""

    def __init__(self, tiers: list[ModelConfig], confidence_threshold: float = 0.7):
        self.tiers = tiers  # Ordered cheapest to most expensive
        self.confidence_threshold = confidence_threshold

    async def route(self, query: str, client_map: dict) -> dict:
        for i, model in enumerate(self.tiers):
            client = client_map.get(model.provider)
            if not client:
                continue

            response = await client.generate(
                model=model.model_id,
                messages=[{"role": "user", "content": query}],
            )

            # Last tier or confidence passes — return
            if i == len(self.tiers) - 1:
                return {"content": response.content, "model": model.model_id, "escalated": i > 0}

            confidence = self._evaluate_confidence(response.content)
            if confidence >= self.confidence_threshold:
                return {"content": response.content, "model": model.model_id, "escalated": False}

        raise RuntimeError("All tiers exhausted")

    def _evaluate_confidence(self, content: str) -> float:
        """Heuristic quality check — length, hedging, structure."""
        score = 0.5
        if len(content.split()) > 20:
            score += 0.15
        hedging = ["i'm not sure", "i don't know", "i cannot"]
        if not any(h in content.lower() for h in hedging):
            score += 0.15
        if any(m in content for m in ["```", "1.", "- "]):
            score += 0.1
        if content.rstrip().endswith((".", "```", ")")):
            score += 0.1
        return min(score, 1.0)

When to use: Quality-critical applications (medical, legal, financial). Tradeoff: higher latency on escalated requests (two model calls).

7. Static vs Dynamic Routing

The two strategies differ in when the model decision is made — at configuration time or at request time.

Static vs Dynamic LLM Routing

Static Routing

Fixed rules, zero overhead, predictable

Model assigned by endpoint, feature flag, or request type
Zero classifier latency — no computation per request
Easy to debug — deterministic model selection
Configuration changes require deployment
Cannot adapt to query-level complexity differences
Misses savings on simple queries within complex features
Best for applications with clearly separated use cases

Dynamic Routing

Per-request optimization, higher savings

Each request evaluated individually for complexity
Achieves 40-70% cost savings on mixed workloads
Adapts in real-time to query characteristics
Classifier adds 50-100ms latency per request
Misclassification risk — wrong model for some queries
Requires logging and evaluation infrastructure
Best for diverse workloads with mixed complexity

Verdict: Start with static routing for predictable workloads. Upgrade to dynamic routing when your traffic mix is diverse and your monthly spend exceeds $500.

Use case

Most production systems use a hybrid: static routing for known features plus dynamic classification for general-purpose endpoints.

Hybrid Approach

Most mature systems combine both. Features with predictable complexity get static routing — your classification endpoint always uses GPT-4o-mini, your code review endpoint always uses GPT-4o. General-purpose endpoints use dynamic classification because query complexity varies per request. Start static. Add dynamic routing only where the data shows mixed complexity patterns.

8. LLM Routing Interview Questions

Model routing appears frequently in GenAI system design interviews — it tests production cost management, not just API usage.

Q: Your company’s LLM costs are $30,000/month. The CEO asks you to cut costs by 50% without reducing quality. What is your plan?

A: Start with instrumentation — log model, token counts, and quality scores for every request. Analyze the distribution: in most systems, 50-70% of requests are simple. Implement three-tier routing: simple requests to GPT-4o-mini (17x cheaper), moderate to GPT-4o, complex to the frontier model. If 60% routes to Tier 1, blended cost drops by roughly 50%. Add semantic caching (30-50% hit rate in support contexts) and prompt compression. Together, these achieve 60-80% cost reduction.

Q: How do you ensure the routing classifier does not degrade output quality?

A: Two mechanisms. First, offline evaluation: run a golden dataset through the classifier and measure misclassification rate — if more than 10% of complex queries get routed to Tier 1, the classifier needs tuning. Second, online monitoring: log routing decisions alongside quality scores from your evaluation pipeline. Track quality by tier, set alerts on degradation, and review weekly.

Q: When would you use a cascading router vs. a classifier-based router?

A: Cascading routers suit quality-critical applications (medical, legal, financial) where the cheap model tries first and escalates if confidence is low. Classifier-based routers suit latency-sensitive applications (chatbots, customer support) where a single model call per request is required and double-calling is too slow.

Q: A teammate proposes always using the cheapest model and only escalating when users complain. What is wrong with this approach?

A: Three problems. First, users receive degraded responses until they complain — and most do not complain, they leave. Second, no signal about silent quality degradation. Third, a reactive loop instead of a proactive one. The correct approach: measure quality per tier before shipping, set escalation thresholds based on automated evaluation, and monitor continuously. User complaints are a lagging indicator.

9. LLM Routing in Production

Production routing requires monitoring, A/B testing, and integration with the broader LLMOps stack.

Cost Savings Metrics

Routing distribution — percentage per tier. Target: 50-70% at Tier 1 for mixed workloads.
Cost per request — by tier, feature, and user segment vs. the “no routing” baseline.
Quality by tier — average quality score per tier. Tier 1 must stay above your minimum threshold.
Misclassification rate — percentage of requests where the routed model produced below-threshold output.
Routing overhead — rule-based: <1ms. ML-based: 50-100ms. Must stay within your latency budget.

A/B Testing Model Changes

Before changing routing configuration, run an A/B test: control group on existing config, treatment on new config. Measure quality score delta, cost delta, and latency delta across 1,000-5,000 requests per group over at least 7 days. Use consistent hashing on user ID to keep each user on the same config during the test.

Production Tools

LiteLLM — Open-source proxy providing a unified API across 100+ LLM providers with routing, fallback, and spend tracking. Self-hosted or cloud.

Portkey — AI gateway with routing, automatic retries, caching, and observability dashboards. Cloud-hosted with a generous free tier.

Martian Model Router — ML-based service that selects the optimal model per request, learning from historical quality data.

Custom routing — Build with the OpenAI and Anthropic SDKs, a classifier function, and a model registry config. Most teams start here and migrate to a managed gateway at 50,000+ requests per day.

Monitoring Checklist

Cost dashboard by model, tier, feature, and day
Quality scores per tier with degradation alerts
Routing distribution chart (sudden shifts = classifier drift)
Fallback rate tracking (frequent fallbacks = provider issues)
Latency percentiles (p50, p95, p99) per tier

10. Summary and Key Takeaways

LLM routing is the highest-leverage cost optimization for production GenAI systems. In most applications, 50-70% of queries are simple enough for models that cost 10-30x less.

Start with rule-based routing. Map features to complexity tiers. Route known-simple features to cheap models. This alone cuts costs by 30-50%.

Add dynamic classification for diverse workloads. A classifier that evaluates each request individually captures savings that static routing misses.

Use cascading routers for quality-critical applications. Try the cheap model first, escalate if confidence is low. This guarantees quality parity while maximizing savings.

Measure everything. Log routing decisions, track quality by tier, and review misclassification rates weekly.

Combine routing with caching and prompt compression. Routing selects the cheapest model. Caching eliminates calls on hits. Prompt compression reduces tokens per call. Together, these achieve 70-90% cost reduction.

LLM Cost Optimization — Complete cost reduction playbook
LLMOps — CI/CD and deployment patterns for production LLM systems
LLM Evaluation — Measuring quality per model tier
Cloud AI Platforms — Provider comparison for multi-provider routing
LLM Caching — Semantic, KV, and prompt caching strategies

Frequently Asked Questions

What is LLM routing?

LLM routing directs each API request to the most cost-effective model that can handle it at the required quality level. A classifier evaluates query complexity — simple tasks go to cheap models (GPT-4o-mini, Claude Haiku), while complex reasoning goes to capable models (GPT-4o, Claude Opus). This typically reduces API costs by 40-70% without degrading output quality. See our LLM cost optimization guide for the complete cost reduction playbook.

How does LLM routing reduce costs?

LLM routing prevents expensive model calls for tasks that cheaper models handle equally well. In a typical production workload, 50-70% of requests are simple enough for a Tier 1 model that costs 10-30x less per token. By routing only complex requests to expensive models, total API spend drops by 40-70%.

What is the difference between static and dynamic LLM routing?

Static routing assigns models based on fixed rules — endpoint, feature flag, or request type. Dynamic routing uses a classifier to evaluate each request at runtime and select the best model based on complexity. Static is predictable and zero-overhead; dynamic achieves higher savings but adds latency. Most production systems use a hybrid of both.

How do I classify query complexity for model routing?

Start with rule-based signals: query length, presence of code blocks, request type from your application, and whether structured output is required. For more accuracy, use a small LLM (GPT-4o-mini) as the classifier — it evaluates the query and returns a complexity tier in under 100ms at minimal cost. Log routing decisions alongside quality scores to continuously improve accuracy.

What is a cascading router for LLMs?

A cascading router sends every request to the cheapest model first, then evaluates the response confidence. If quality meets the threshold, the cheap response is returned. If confidence is low, the request escalates to the next tier. This guarantees quality parity with the expensive model while achieving cost savings proportional to the cheap model's success rate.

What tools support LLM routing in production?

LiteLLM provides a unified API across 100+ providers with built-in routing. Portkey offers an AI gateway with automatic retries and model fallback. Martian uses ML to select the optimal model per request. For custom routing, most teams build a thin layer using the OpenAI and Anthropic SDKs with a classifier function.

How do I handle fallback when a routed model fails?

Implement a fallback chain: if the selected model returns an error, retry once, then escalate to the next tier. For rate limits, maintain a token bucket per provider and pre-route away from providers approaching their limit. Log every fallback event — frequent fallbacks indicate capacity issues that need architectural attention.

What metrics should I track for an LLM routing system?

Track five metrics: routing distribution (percentage per tier — aim for 50-70% at Tier 1), cost per request by tier, quality scores by tier, misclassification rate (requests routed to cheap models that produced low-quality output), and routing latency overhead (should be under 50ms). Review weekly to tune classifier thresholds.

Can I use LLM routing with RAG systems?

Yes — use retrieval results as a routing signal. If retrieved chunks have high relevance scores and the query is a direct factual lookup, route to a cheap model. If chunks have low relevance or the query requires synthesis, route to a capable model. This routes 40-60% of RAG queries to Tier 1 models.

How does LLM routing differ from load balancing?

Load balancing distributes requests across identical instances of the same model for throughput. LLM routing selects which model to call based on request characteristics. In production, you need both: routing selects the model, then load balancing distributes across that model's instances.