Skip to content

LLMOps — CI/CD, Eval Gates & LLM Deployment (2026)

This LLMOps guide covers the operational discipline of running LLM applications in production — from prompt versioning and eval-gated CI/CD to model routing, cost monitoring, and canary deployments. Think of it as MLOps rebuilt from scratch for the unique challenges of large language models.

LLMOps gives you version control for prompts, eval-gated CI/CD, model routing, and production monitoring — the operational discipline that prevents silent quality regressions in LLM applications.

MLOps solved model training pipelines. You version datasets, track experiments, and deploy model artifacts through reproducible pipelines. That discipline took years to mature. LLMOps is the equivalent for LLM-powered applications, but the problems are fundamentally different.

In traditional ML, the artifact you deploy is a model binary. In LLM applications, the artifacts are prompts, retrieval configurations, model provider settings, and evaluation thresholds. A one-word change to a system prompt can shift output quality more than a full model retrain would in traditional ML. Yet most teams treat prompts as hardcoded strings buried in application code.

The consequences are predictable. A developer changes a prompt on Friday afternoon. No eval runs. Monday morning, customer support tickets spike 3x because the chatbot started hallucinating product features that do not exist. There is no way to roll back because nobody versioned the old prompt. There is no way to measure the damage because nobody established quality baselines.

LLMOps prevents this. It gives you version control for prompts, automated evaluation gates in CI, model routing that lets you swap providers without code changes, and production monitoring that catches quality regressions before users report them.

You will learn:

  • How LLMOps differs from MLOps and why those differences matter
  • How to build a prompt registry with semantic versioning
  • How to gate deployments with automated eval suites
  • How to implement model routing and A/B testing for prompts
  • How to monitor cost, quality, and latency in production
  • What interviewers expect when they ask about LLM deployment operations

DevelopmentImpact
LangSmith Eval GatesNative CI/CD integration — block merges when eval scores drop below threshold
Langfuse 3.0Open-source tracing + prompt management with built-in A/B testing and cost attribution
Humanloop Prompt PipelinesVisual prompt versioning with automatic regression testing on every change
OpenAI Evals v2Standardized eval framework adopted by multiple providers for cross-model benchmarking
Model Router PatternWidely adopted: route requests to different models based on complexity, cost, and latency requirements
FinOps for AIDedicated cost management tools for LLM inference — per-feature, per-user, per-request attribution

LLMOps shares MLOps DNA but faces fundamentally different problems: prompts are code, evals replace unit tests, provider switching is routine, and costs are unpredictable.

Traditional MLOps and LLMOps share some DNA — both care about versioning, reproducibility, and monitoring. But the differences are significant enough that MLOps tools alone cannot solve LLMOps problems.

Prompts are code. In MLOps, the artifact is a trained model file. In LLMOps, the primary artifact is a prompt template — a text string that controls behavior. Prompts need version control, diffing, approval workflows, and rollback capabilities. A model binary is deterministic given the same input. A prompt produces non-deterministic output even with temperature set to 0 (provider-side batching and quantization cause variance).

Evaluation replaces unit tests. You cannot assert that an LLM output equals an expected string. Instead, you run evaluation suites that score outputs on dimensions like correctness, faithfulness, relevance, and safety. These scores determine whether a deployment proceeds.

Provider switching is routine. Your LLM provider will have outages, change pricing, deprecate models, or release better alternatives. A well-designed LLMOps system abstracts the provider so you can switch from GPT-4o to Claude Sonnet to Gemini Pro without touching application code.

Cost is unpredictable. A traditional ML model runs on fixed infrastructure. LLM inference costs scale with token volume, and a single prompt change that increases output verbosity can double your monthly bill overnight.

Non-determinism is permanent. Even identical inputs produce slightly different outputs across runs. Your testing, monitoring, and alerting must account for this fundamental property.

Teams without LLMOps discover problems through user complaints. A prompt change silently degrades quality. A model deprecation notice arrives with 30 days warning and no migration path. A feature launch drives 10x token usage and the monthly bill jumps from $5,000 to $50,000. These are not hypotheticals — they are patterns repeated across every company deploying LLM applications at scale.


LLMOps CI/CD extends traditional pipelines with prompt versioning, automated eval suites, quality gates, canary releases, and continuous production monitoring.

LLMOps CI/CD Pipeline — From Commit to Production

Traditional CI/CD tests code. LLMOps CI/CD also tests prompts, evaluates model quality, and gates on regression scores.

CommitPrompt template, config, or code change pushed to repo
Prompt Registry
Version Control
Change Detection
Eval SuiteRun automated evaluations against golden datasets
Unit Evals
Regression Tests
A/B Scoring
Quality GateBlock deployment if quality score drops below threshold
Score Comparison
Regression Check
Human Review Flag
StagingShadow deployment with production traffic sampling
Canary Release
Traffic Mirroring
Cost Monitoring
ProductionFull rollout with continuous monitoring and alerting
Model Router
A/B Testing
Rollback Trigger
MonitorTrack quality, cost, latency, and user feedback in real-time
Quality Scores
Cost Dashboard
Alert System
Idle

Prompts are not configuration. They are the core logic of your application. Treat them accordingly.

A prompt registry stores prompt templates as versioned artifacts with metadata: author, creation date, eval scores, and deployment history. Every change creates a new version. No overwrites. No in-place edits.

# prompt_registry.py — Simple file-based prompt registry
import json
import hashlib
from pathlib import Path
from datetime import datetime
class PromptRegistry:
def __init__(self, registry_dir: str = "./prompts"):
self.registry_dir = Path(registry_dir)
self.registry_dir.mkdir(exist_ok=True)
def register(self, name: str, template: str, metadata: dict = None) -> str:
"""Register a new prompt version. Returns version hash."""
version_hash = hashlib.sha256(template.encode()).hexdigest()[:12]
entry = {
"name": name,
"version": version_hash,
"template": template,
"metadata": metadata or {},
"created_at": datetime.utcnow().isoformat(),
"eval_scores": {}, # populated by eval pipeline
}
path = self.registry_dir / f"{name}_{version_hash}.json"
path.write_text(json.dumps(entry, indent=2))
return version_hash
def get(self, name: str, version: str = "latest") -> dict:
"""Retrieve a prompt by name and version."""
if version == "latest":
versions = sorted(
self.registry_dir.glob(f"{name}_*.json"),
key=lambda p: json.loads(p.read_text())["created_at"],
reverse=True,
)
if not versions:
raise ValueError(f"No versions found for prompt: {name}")
return json.loads(versions[0].read_text())
path = self.registry_dir / f"{name}_{version}.json"
return json.loads(path.read_text())
def list_versions(self, name: str) -> list[dict]:
"""List all versions of a prompt with their eval scores."""
versions = []
for path in self.registry_dir.glob(f"{name}_*.json"):
entry = json.loads(path.read_text())
versions.append({
"version": entry["version"],
"created_at": entry["created_at"],
"eval_scores": entry["eval_scores"],
})
return sorted(versions, key=lambda v: v["created_at"], reverse=True)
# Usage
registry = PromptRegistry()
v1 = registry.register(
name="customer_support",
template="You are a helpful support agent for {company}. Answer the user's question based only on the provided context.\n\nContext: {context}\nQuestion: {question}",
metadata={"author": "team-ml", "task": "rag-qa"},
)
print(f"Registered version: {v1}")

The key principle: every prompt change flows through the same pipeline as code changes. Pull request, review, eval run, merge. No one edits prompts directly in production.

An eval gate blocks deployment when quality scores drop below a defined threshold. This is the LLMOps equivalent of a failing test suite.

# eval_gate.py — Run evals and gate deployment
import json
from dataclasses import dataclass
@dataclass
class EvalResult:
metric: str
score: float
threshold: float
passed: bool
def run_eval_suite(
prompt_template: str,
golden_dataset: list[dict],
llm_fn, # callable that takes a prompt and returns a response
thresholds: dict[str, float],
) -> list[EvalResult]:
"""Run evaluation suite against golden dataset. Returns pass/fail per metric."""
scores = {"correctness": [], "faithfulness": [], "relevance": []}
for example in golden_dataset:
prompt = prompt_template.format(**example["inputs"])
response = llm_fn(prompt)
# Score each dimension (simplified — use RAGAS or custom scorers in production)
scores["correctness"].append(
1.0 if example["expected_answer"].lower() in response.lower() else 0.0
)
scores["faithfulness"].append(
1.0 if not _contains_hallucination(response, example["context"]) else 0.0
)
scores["relevance"].append(
_relevance_score(response, example["inputs"]["question"])
)
results = []
for metric, values in scores.items():
avg_score = sum(values) / len(values) if values else 0.0
threshold = thresholds.get(metric, 0.8)
results.append(EvalResult(
metric=metric,
score=round(avg_score, 3),
threshold=threshold,
passed=avg_score >= threshold,
))
return results
def gate_deployment(results: list[EvalResult]) -> bool:
"""Returns True if all eval metrics pass their thresholds."""
all_passed = all(r.passed for r in results)
for r in results:
status = "PASS" if r.passed else "FAIL"
print(f" [{status}] {r.metric}: {r.score:.3f} (threshold: {r.threshold})")
if not all_passed:
print("\nDeployment BLOCKED — eval gate failed.")
else:
print("\nDeployment APPROVED — all gates passed.")
return all_passed

Run this in your CI pipeline on every PR that touches prompt files. No exceptions. A 2% drop in correctness that slips through review will cost you far more to detect and fix in production.


The model router, A/B testing harness, and cost monitor are the three core components every production LLMOps system needs beyond a basic prompt registry.

A model router abstracts LLM providers behind a unified interface. You configure routing rules — not code changes — to direct traffic to different models.

# model_router.py — Route requests to different LLM providers
import random
from dataclasses import dataclass, field
@dataclass
class ModelConfig:
provider: str # "openai", "anthropic", "google"
model: str # "gpt-4o", "claude-sonnet-4", "gemini-2.0-pro"
weight: float = 1.0 # traffic weight for A/B testing
max_cost_per_req: float = 0.10 # cost ceiling per request
timeout_ms: int = 30000
@dataclass
class RouterConfig:
models: list[ModelConfig] = field(default_factory=list)
fallback_model: str = "gpt-4o-mini"
class ModelRouter:
def __init__(self, config: RouterConfig):
self.config = config
self._clients = {} # initialized lazily per provider
def route(self, request: dict) -> ModelConfig:
"""Select a model based on routing rules and traffic weights."""
eligible = [m for m in self.config.models if m.weight > 0]
if not eligible:
raise ValueError("No eligible models configured")
# Weighted random selection for A/B testing
total_weight = sum(m.weight for m in eligible)
rand = random.uniform(0, total_weight)
cumulative = 0.0
for model in eligible:
cumulative += model.weight
if rand <= cumulative:
return model
return eligible[-1] # fallback
def call(self, prompt: str, request_metadata: dict = None) -> dict:
"""Route and call the selected model."""
model = self.route(request_metadata or {})
# In production: call the actual provider API here
return {
"model": model.model,
"provider": model.provider,
"response": f"[Response from {model.model}]",
}
# Configuration-driven routing — no code changes to switch models
router = ModelRouter(RouterConfig(
models=[
ModelConfig(provider="anthropic", model="claude-sonnet-4", weight=0.7),
ModelConfig(provider="openai", model="gpt-4o", weight=0.2),
ModelConfig(provider="google", model="gemini-2.0-pro", weight=0.1),
],
fallback_model="gpt-4o-mini",
))

This pattern pays for itself the first time a provider has an outage. You update the weights in your config file, redeploy, and traffic shifts in seconds. No emergency code changes.

Prompt A/B testing splits traffic between prompt versions and measures quality metrics to determine which version performs better. The model router above already supports traffic weighting. Add eval scoring to complete the loop.

# ab_test.py — Track and compare prompt versions in production
from collections import defaultdict
from datetime import datetime
class PromptABTest:
def __init__(self, test_name: str, variants: dict[str, str]):
"""
variants: {"control": "prompt_v1_hash", "treatment": "prompt_v2_hash"}
"""
self.test_name = test_name
self.variants = variants
self.metrics = defaultdict(lambda: {"scores": [], "latencies": [], "costs": []})
def record(self, variant: str, score: float, latency_ms: float, cost: float):
"""Record a single observation for a variant."""
self.metrics[variant]["scores"].append(score)
self.metrics[variant]["latencies"].append(latency_ms)
self.metrics[variant]["costs"].append(cost)
def summary(self) -> dict:
"""Return summary statistics for each variant."""
result = {}
for variant, data in self.metrics.items():
n = len(data["scores"])
if n == 0:
continue
result[variant] = {
"n": n,
"avg_score": round(sum(data["scores"]) / n, 3),
"avg_latency_ms": round(sum(data["latencies"]) / n, 1),
"avg_cost": round(sum(data["costs"]) / n, 4),
"p95_latency_ms": round(sorted(data["latencies"])[int(n * 0.95)], 1) if n >= 20 else None,
}
return result

Run A/B tests for at least 1,000 observations per variant before drawing conclusions. Smaller samples produce unreliable results because LLM output quality has high variance.

LLM costs are the biggest operational surprise for teams without monitoring. A single prompt change that adds “think step by step” to a system prompt can increase output tokens by 3x.

# cost_monitor.py — Track and alert on LLM costs
from dataclasses import dataclass
from datetime import datetime, timedelta
from collections import defaultdict
# Approximate costs per 1M tokens (input/output) as of early 2026
COST_PER_MILLION = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
"claude-haiku-3.5": {"input": 0.80, "output": 4.00},
"gemini-2.0-pro": {"input": 1.25, "output": 5.00},
}
@dataclass
class RequestLog:
model: str
input_tokens: int
output_tokens: int
feature: str # which product feature made this call
timestamp: datetime
class CostMonitor:
def __init__(self, daily_budget: float = 500.0, alert_threshold: float = 0.8):
self.daily_budget = daily_budget
self.alert_threshold = alert_threshold
self.logs: list[RequestLog] = []
def log_request(self, log: RequestLog):
self.logs.append(log)
def cost_for_request(self, log: RequestLog) -> float:
rates = COST_PER_MILLION.get(log.model, {"input": 5.0, "output": 15.0})
input_cost = (log.input_tokens / 1_000_000) * rates["input"]
output_cost = (log.output_tokens / 1_000_000) * rates["output"]
return input_cost + output_cost
def daily_spend(self, date: datetime = None) -> dict:
date = date or datetime.utcnow()
start = date.replace(hour=0, minute=0, second=0, microsecond=0)
end = start + timedelta(days=1)
day_logs = [l for l in self.logs if start <= l.timestamp < end]
by_feature = defaultdict(float)
by_model = defaultdict(float)
total = 0.0
for log in day_logs:
cost = self.cost_for_request(log)
total += cost
by_feature[log.feature] += cost
by_model[log.model] += cost
return {
"total": round(total, 2),
"budget_used_pct": round(total / self.daily_budget * 100, 1),
"alert": total >= self.daily_budget * self.alert_threshold,
"by_feature": dict(by_feature),
"by_model": dict(by_model),
}

Set alerts at 80% of your daily budget. By the time you hit 100%, you have already lost the money. Attribution by feature is critical — you need to know which feature is driving the spend, not just the total.


Three patterns apply to LLM prompt and model deployments: canary releases for gradual traffic shifting, blue-green with shadow traffic for safe cutover, and feature-flag rollouts for user-segment control.

A canary release sends a small percentage of traffic (typically 5-10%) to the new prompt or model version. If quality metrics hold, you gradually increase the traffic share. If metrics drop, you roll back instantly.

The pattern works like this:

  1. Deploy new prompt version alongside the current version
  2. Route 5% of traffic to the new version
  3. Monitor quality scores, latency, and cost for 1 hour minimum
  4. If metrics are stable, increase to 25%, then 50%, then 100%
  5. If any metric degrades beyond threshold, route 100% back to the old version

Canary releases catch problems that eval suites miss. Your golden dataset covers known scenarios. Production traffic includes edge cases, adversarial inputs, and usage patterns that no test suite anticipates.

In a blue-green deployment, you run two identical environments. “Blue” serves production traffic. “Green” receives the update. You switch DNS or load balancer routing to cut over.

For LLM applications, add shadow traffic: send a copy of every production request to the green environment without returning its response to users. Compare green’s outputs against blue’s outputs. If green scores equal or higher on your eval metrics, proceed with the cutover.

Shadow traffic costs double the inference spend during the testing window. Budget for it. The alternative — deploying untested changes directly to production — costs more when things go wrong.

Use feature flags to control which users receive the new prompt version. Start with internal users, expand to beta users, then roll out to all users. This approach works well when you want human feedback before full deployment.

feature-flags.yaml
prompt_v2_rollout:
enabled: true
rules:
- segment: internal_users
percentage: 100
- segment: beta_users
percentage: 50
- segment: all_users
percentage: 0 # increase after internal validation

LangSmith, Langfuse, Humanloop, Braintrust, PromptLayer, and Weights & Biases each serve different team sizes and priorities — the right choice depends on your stack and data ownership requirements.

ToolFocusPricingBest For
LangSmithTracing, evals, prompt hubFree tier + paidTeams already using LangChain
LangfuseOpen-source tracing + prompt managementFree (self-hosted) or cloudTeams that want full control of their data
Weights & BiasesExperiment tracking, eval dashboardsFree tier + paidTeams with existing W&B for ML experiments
HumanloopPrompt versioning, eval pipelines, A/B testingPaidProduct teams shipping prompt-driven features
PromptLayerPrompt registry, version history, analyticsFree tier + paidQuick prompt versioning without infrastructure
BraintrustEval-first platform, dataset managementFree tier + paidTeams that prioritize evaluation rigor

If you already use LangChain, start with LangSmith. The integration is native and the free tier covers small projects.

If you need open-source and self-hosted, Langfuse is the clear choice. You own your data, you control the infrastructure, and the tracing quality matches commercial tools.

If you need enterprise-grade prompt management with approval workflows, Humanloop or Braintrust fit better. They are designed for teams where prompt changes go through formal review processes.

For monitoring and cost tracking specifically, most teams combine a tracing tool (LangSmith or Langfuse) with a custom cost dashboard. No single tool covers every LLMOps concern perfectly today.

For a deeper comparison of LangSmith and Langfuse, see LangSmith vs Langfuse.


LLMOps interview questions test whether you understand concrete mechanisms — prompt versioning, eval gates, rollback paths, and provider failover — not vague references to “monitoring.”

LLMOps questions test whether you understand the operational complexity of production LLM applications. Interviewers want to hear about concrete mechanisms — not vague references to “monitoring” and “testing.”

Q: “How do you deploy a prompt change safely?”

Weak: “We test the prompt manually, then deploy it.”

Strong: “Every prompt change is a versioned artifact in our prompt registry. When a developer submits a new version, CI automatically runs our eval suite against a golden dataset of 500+ examples. The suite scores correctness, faithfulness, and relevance. If any metric drops more than 2% below the current production baseline, the merge is blocked. If evals pass, we deploy via canary — 5% of traffic hits the new prompt for 2 hours while we monitor quality scores, latency p95, and cost per request. Only after metrics stabilize do we ramp to 100%. The entire rollback path is a config change that takes <30 seconds.”

Q: “How do you handle a provider outage?”

Weak: “We would switch to another provider.”

Strong: “Our model router abstracts the LLM provider behind a unified interface. Each provider has a health check endpoint polled every 30 seconds. When a provider fails three consecutive checks, the router automatically shifts its traffic weight to zero and redistributes to healthy providers based on their configured weights. We maintain at least two providers in every routing config, and we run our eval suite against each provider monthly to verify output quality parity. The failover is automatic — no engineer needs to wake up at 3 AM.”

  • Walk through your CI/CD pipeline for prompt changes. What blocks a merge?
  • How would you A/B test two different prompt strategies in production?
  • Your LLM costs doubled this month. How do you investigate and fix it?
  • Design a model routing system that handles provider outages automatically
  • How do you version prompts? Why not just use Git?
  • What metrics do you monitor for an LLM in production? What triggers an alert?

Production LLMOps requires visibility across five dimensions — quality, latency percentiles, cost per request, error rates, and user feedback — plus a rollback strategy that answers how fast you detect, revert, and prevent recurrence.

Every LLMOps team needs visibility into five dimensions. Build a dashboard that surfaces these in real-time:

Quality scores. Track average eval scores (correctness, faithfulness, relevance) over time. Plot a 7-day moving average. Alert when the moving average drops more than 5% from baseline.

Latency percentiles. Track p50, p95, and p99 latency per model and per feature. LLM latency is highly variable — p50 might be 800ms while p99 is 4,500ms. Alert on p95, not average.

Cost per request. Track the mean and p95 cost per request, broken down by feature, model, and user segment. A cost spike usually means either a prompt change increased token usage or a traffic surge hit an expensive model.

Error rates. Track API errors (timeouts, rate limits, 500s), guardrail blocks, and parse failures (when structured output fails to parse). Each category needs its own alert threshold.

User feedback signals. Track thumbs up/down ratios, regeneration requests, and conversation abandonment rates. These are the ground truth that your automated metrics approximate.

Your rollback plan should answer three questions:

  1. How fast can you detect a problem? Automated quality monitoring should catch regressions within 15 minutes of deployment.
  2. How fast can you revert? Routing config changes should propagate in <60 seconds. Prompt version rollbacks should be a single command.
  3. How do you prevent the same problem from recurring? Every rollback triggers a post-mortem that adds new examples to your golden dataset covering the failure case.

Store all LLMOps configuration in version-controlled files:

llmops-config.yaml
prompt_registry:
storage: s3://prompts-prod/registry
approval_required: true
auto_eval: true
eval_gates:
golden_dataset: s3://evals/golden-v3.jsonl
thresholds:
correctness: 0.85
faithfulness: 0.90
relevance: 0.80
min_examples: 500
block_on_regression: true
routing:
primary:
provider: anthropic
model: claude-sonnet-4
weight: 0.8
secondary:
provider: openai
model: gpt-4o
weight: 0.2
fallback:
provider: openai
model: gpt-4o-mini
health_check_interval_seconds: 30
failover_after_consecutive_failures: 3
monitoring:
quality_alert_threshold: -0.05 # 5% drop from baseline
latency_p95_alert_ms: 5000
daily_cost_budget: 500.0
cost_alert_threshold: 0.80

This file is the single source of truth for your LLMOps configuration. Every change goes through code review. Every deployment reads from this file. No one edits production settings through a UI or CLI without it being reflected here.


Every LLM application that serves users needs prompt versioning, eval gates, canary releases, and production monitoring — and the time to build this infrastructure is before your first regression, not after.

QuestionAnswer
Do I need LLMOps?Yes — if you deploy LLM features to users, you need prompt versioning, eval gates, and monitoring
When should I start?Before your first production deployment. Retrofitting is 10x harder
What tool should I pick?LangSmith if you use LangChain. Langfuse if you want open-source. Humanloop for enterprise prompt workflows
How do I prevent bad deployments?Eval gates in CI + canary releases + automated quality monitoring
What about cost?Track per-request, per-feature, per-model. Alert at 80% of daily budget
Git for prompts?Git is necessary but not sufficient. Add a prompt registry for metadata, eval scores, and deployment history
  • LangSmith — Tracing, evaluation, and prompt management by LangChain
  • Langfuse — Open-source LLM observability and prompt management
  • Humanloop — Prompt versioning, evaluation pipelines, and A/B testing
  • Braintrust — Eval-first LLMOps platform with dataset management
  • PromptLayer — Prompt registry with version history and analytics
  • Weights & Biases Prompts — Experiment tracking for LLM applications

Last updated: March 2026. LLMOps tooling is evolving rapidly; verify current capabilities against official documentation.

Frequently Asked Questions

What is LLMOps and how is it different from MLOps?

LLMOps is the operational discipline for running LLM applications in production. Unlike MLOps where you deploy model binaries, LLMOps manages prompts, retrieval configurations, model provider settings, and evaluation thresholds. A one-word change to a system prompt can shift output quality more than a full model retrain in traditional ML. LLMOps provides version control for prompts, automated evaluation gates in CI, model routing, and production monitoring.

How do you version and deploy prompts in production?

Build a prompt registry with semantic versioning where each prompt has a version number, creation timestamp, and associated evaluation scores. Gate deployments with automated eval suites that run against a test dataset before any prompt change reaches production. Use canary deployments to roll out prompt changes gradually. This prevents the common failure: a developer changes a prompt without running evals, and quality regressions go undetected until users report them.

What is model routing in LLMOps?

Model routing lets you swap LLM providers without code changes. Route requests to different models based on task complexity, cost constraints, or latency requirements. Simple queries go to a cheaper, faster model (like GPT-4o-mini), while complex reasoning tasks route to a more capable model (like Claude Opus 4). This also enables A/B testing of prompts and models in production with traffic splitting.

How do you monitor LLM applications in production?

Monitor three dimensions: cost (token usage per request, daily spend, cost per feature), quality (automated eval scores, user feedback, hallucination rate), and latency (time to first token, total response time, retrieval latency). Set alerts for quality drift — when eval scores drop below baseline. Track these metrics per prompt version and model to identify regressions quickly.

What is an eval gate in LLMOps CI/CD?

An eval gate blocks deployment when quality scores drop below a defined threshold. It is the LLMOps equivalent of a failing test suite. The eval suite scores LLM outputs on dimensions like correctness, faithfulness, and relevance against a golden dataset. If any metric drops more than 2% below the production baseline, the merge is blocked automatically.

What is a canary release for LLM deployments?

A canary release sends a small percentage of traffic (typically 5-10%) to a new prompt or model version. If quality metrics hold, traffic is gradually increased to 25%, 50%, then 100%. If any metric degrades beyond the threshold, 100% of traffic routes back to the old version instantly. Canary releases catch problems that eval suites miss because production traffic includes edge cases no test suite anticipates.

How do you handle an LLM provider outage?

A model router abstracts the LLM provider behind a unified interface with health check endpoints polled every 30 seconds. When a provider fails three consecutive checks, the router automatically shifts its traffic weight to zero and redistributes to healthy providers. Maintaining at least two providers in every routing config ensures automatic failover without requiring manual intervention.

How do you A/B test prompts in production?

Prompt A/B testing splits traffic between prompt versions using the model router's traffic weighting and measures quality metrics to determine which version performs better. Run A/B tests for at least 1,000 observations per variant before drawing conclusions, because LLM output quality has high variance and smaller samples produce unreliable results.

How do you prevent LLM cost surprises?

Track per-request cost broken down by feature, model, and user segment. Set alerts at 80% of your daily budget so you can act before hitting 100%. A single prompt change that adds instructions like 'think step by step' can increase output tokens by 3x and double monthly bills overnight. Attribution by feature is critical to identify which part of the application is driving spend.

Which LLMOps tools should I use — LangSmith, Langfuse, or Humanloop?

If you already use LangChain, start with LangSmith for native integration. If you need open-source and self-hosted, Langfuse is the clear choice — you own your data and control the infrastructure. For enterprise-grade prompt management with approval workflows, Humanloop or Braintrust fit better. Most teams combine a tracing tool with a custom cost dashboard since no single tool covers every LLMOps concern.