LLM Observability — Trace, Debug & Monitor LLM Apps (2026)
Your LLM app works in development. Users report “it sometimes gives wrong answers” in production. You check the logs and find… a prompt string and a response string. No trace of what happened in between. That gap between “it works on my machine” and “it fails for 12% of users” is exactly what LLM observability solves. You need traces, metrics, and evaluations — not just logs.
Who this is for:
- Junior engineers: You want to understand why your LLM app gives different answers to the same question and how to debug it
- Senior engineers: You need a production-grade monitoring strategy for non-deterministic AI systems — cost tracking, quality drift detection, and alerting
What Changed in 2026
Section titled “What Changed in 2026”LLM observability matured fast. Here’s what’s different from 2024-2025:
| Shift | Before (2024) | Now (2026) |
|---|---|---|
| OpenTelemetry for LLMs | Experimental semantic conventions | Stable gen_ai.* attributes in OpenTelemetry 1.30+ |
| Cost tracking | Manual token counting | Automatic per-trace cost with model pricing tables baked in |
| Evaluation integration | Separate pipelines | Inline evaluation scores attached directly to traces |
| Multi-framework support | Each tool supported 1-2 frameworks | Langfuse supports 15+ frameworks; LangSmith expanded beyond LangChain |
| Self-hosting | Langfuse required manual Docker setup | Langfuse ships one-click deploy to Railway, Render, and Kubernetes Helm charts |
The biggest shift: observability is no longer optional for production LLM apps. Teams that skip it spend 3-5x longer debugging issues because non-deterministic systems produce failures that traditional logging cannot explain.
How LLM Observability Works
Section titled “How LLM Observability Works”Traditional software observability tracks three pillars: logs, metrics, and traces. LLM observability adapts these for non-deterministic systems where the same input can produce different outputs.
The Three Pillars for LLMs
Section titled “The Three Pillars for LLMs”1. Traces — The full lifecycle of a request through your LLM pipeline. A trace for a RAG pipeline captures: the user query, the embedding call, the vector search results, the reranking step, and the final LLM generation. Each step is a span with inputs, outputs, latency, and token counts.
2. Metrics — Aggregated numbers that tell you if things are getting better or worse:
- Token usage per request (input + output)
- Latency at p50, p95, and p99
- Cost per user, per feature, per model
- Error rate and retry frequency
3. Evaluations — Quality scores attached to traces. You run LLM-as-judge scoring (relevance, faithfulness, helpfulness) or custom heuristic checks against sampled traces. This is what catches quality drift before users complain.
The mental model: logs tell you what happened. Traces tell you where it happened. Evaluations tell you how well it happened. You need all three.
Why Traditional Logging Falls Short
Section titled “Why Traditional Logging Falls Short”A print(response) statement captures the final output. But when your 5-step agent pipeline fails, you need to know which step failed and why. Did the retriever return irrelevant documents? Did the prompt template inject stale context? Did the LLM hallucinate despite receiving correct context? Without span-level visibility, debugging is guesswork.
Deep Dive: OpenTelemetry for LLM Applications
Section titled “Deep Dive: OpenTelemetry for LLM Applications”OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. In 2026, the gen_ai.* semantic conventions are stable, meaning you can instrument once and export to any backend — Langfuse, Datadog, Grafana, or your own storage.
Key Semantic Attributes
Section titled “Key Semantic Attributes”| Attribute | What It Captures |
|---|---|
gen_ai.system | Provider name (openai, anthropic, google) |
gen_ai.request.model | Model ID (gpt-4o, claude-sonnet-4-20250514) |
gen_ai.usage.input_tokens | Tokens in the prompt |
gen_ai.usage.output_tokens | Tokens in the completion |
gen_ai.response.finish_reason | Why the model stopped (stop, length, tool_calls) |
Python: OpenTelemetry Instrumentation for LLM Calls
Section titled “Python: OpenTelemetry Instrumentation for LLM Calls”from opentelemetry import tracefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterimport openai
# Set up OpenTelemetryprovider = TracerProvider()processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))provider.add_span_processor(processor)trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-app")
def call_llm(prompt: str, model: str = "gpt-4o") -> str: with tracer.start_as_current_span("llm.chat_completion") as span: span.set_attribute("gen_ai.system", "openai") span.set_attribute("gen_ai.request.model", model)
client = openai.OpenAI() response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], )
# Capture token usage span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens) span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens) span.set_attribute("gen_ai.response.finish_reason", response.choices[0].finish_reason)
return response.choices[0].message.contentThis produces a span with all the data you need to debug, cost-track, and monitor each LLM call.
LLM Observability Architecture
Section titled “LLM Observability Architecture”Trace data flows from your LLM application through instrumentation and collection layers before reaching the analysis dashboards where you debug failures and monitor cost.
LLM Observability Tracing Pipeline
Section titled “LLM Observability Tracing Pipeline”📊 Visual Explanation
Section titled “📊 Visual Explanation”LLM Observability Tracing Pipeline
How trace data flows from your application to dashboards
The pipeline is the same regardless of which tool you pick. The difference is where the “Collection” and “Analysis” layers run — managed cloud (LangSmith) or your own infrastructure (Langfuse self-hosted).
LLM Observability Code Examples
Section titled “LLM Observability Code Examples”These examples cover the three highest-value observability patterns: decorator-based tracing with Langfuse, the key metrics that matter in production, and the alert rules that catch 90% of issues.
Langfuse: Decorator-Based Tracing
Section titled “Langfuse: Decorator-Based Tracing”Langfuse’s @observe() decorator is the fastest way to add tracing to existing Python code. No manual span management required:
from langfuse.decorators import observe, langfuse_contextimport openai
@observe()def rag_pipeline(query: str) -> str: """Full RAG pipeline with automatic tracing.""" embedding = embed_query(query) documents = search_vectors(embedding) answer = generate_answer(query, documents) return answer
@observe()def embed_query(query: str) -> list[float]: client = openai.OpenAI() response = client.embeddings.create( model="text-embedding-3-small", input=query ) return response.data[0].embedding
@observe()def search_vectors(embedding: list[float]) -> list[str]: # Your vector DB search logic return ["doc_1: Pricing starts at $99/mo", "doc_2: Enterprise plan available"]
@observe()def generate_answer(query: str, context: list[str]) -> str: client = openai.OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Answer using this context: {context}"}, {"role": "user", "content": query}, ], )
# Attach evaluation score to the current trace langfuse_context.score_current_trace( name="relevance", value=0.92, comment="Automated relevance check" )
return response.choices[0].message.contentEach @observe() function becomes a span in the trace tree. The parent-child relationship is automatic — rag_pipeline is the root span, and embed_query, search_vectors, and generate_answer are child spans.
Key Metrics to Track in Production
Section titled “Key Metrics to Track in Production”These are the metrics that actually matter. Track them from day one:
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency p95 | Users abandon after 3-5 seconds | > 5s for chat, > 30s for batch |
| Token cost per request | Budget blowouts happen overnight | > 2x your average cost |
| Error rate | API failures, rate limits, timeouts | > 1% of requests |
| Quality score (LLM-as-judge) | Catches model degradation | Score drops > 10% week-over-week |
| Retrieval relevance | Bad retrieval = bad answers, always | <70% relevance on sampled traces |
Production Alert Patterns
Section titled “Production Alert Patterns”Set up these three alerts and you’ll catch 90% of production issues:
1. Cost spike alert — Token cost per hour exceeds 2x the 7-day average. Common cause: a prompt template change that doubled context length, or a retry loop burning tokens.
2. Quality degradation alert — Average evaluation score drops >10% over 24 hours. Common cause: model provider shipped an update (yes, this happens), or your RAG index went stale.
3. Latency outlier alert — p99 latency exceeds 10 seconds. Common cause: a specific user query triggers excessive tool calls in an agent loop, or a vector DB connection pool is exhausted.
Trade-offs and Pitfalls
Section titled “Trade-offs and Pitfalls”LangSmith offers zero-config setup for LangChain teams; Langfuse offers framework independence and self-hosting — the right choice depends on your stack and data sovereignty requirements.
LangSmith vs Langfuse — Which LLM Tracing Tool?
Section titled “LangSmith vs Langfuse — Which LLM Tracing Tool?”📊 Visual Explanation
Section titled “📊 Visual Explanation”LangSmith vs Langfuse — Which LLM Tracing Tool?
- Zero-config with LangChain apps
- Hosted playground for prompt testing
- Built-in dataset management
- Vendor lock-in to LangChain ecosystem
- Pricing scales with trace volume
- Framework-agnostic (works with any LLM library)
- Self-host for data sovereignty
- Cost tracking per trace/user/model
- Requires hosting infrastructure
- Smaller community than LangSmith
Decision Tree: Which Tool Should You Pick?
Section titled “Decision Tree: Which Tool Should You Pick?”Ask these questions in order:
- Is your entire stack LangChain/LangGraph? Yes → LangSmith gives you zero-config tracing with one environment variable. Done.
- Do you need self-hosting for data sovereignty? Yes → Langfuse. LangSmith has no self-hosted option.
- Are you using multiple frameworks (LangChain + LlamaIndex + custom)? Yes → Langfuse. It traces any framework equally well.
- Is cost tracking per user/feature critical? Yes → Langfuse has deeper cost analytics built in.
- Do you want the simplest managed setup? Yes → LangSmith. No infrastructure to manage.
Pricing Reality
Section titled “Pricing Reality”For a team of 5 engineers running 500k traces/month:
| LangSmith | Langfuse Cloud | Langfuse Self-Hosted | |
|---|---|---|---|
| Monthly cost | $200-500 (volume-based) | ~$120 (Pro plan) | $0 + infrastructure |
| Infrastructure | Managed | Managed | You run PostgreSQL + ClickHouse |
| Data location | LangChain’s cloud | Langfuse’s cloud | Your VPC |
Other Tools in the Space
Section titled “Other Tools in the Space”LangSmith and Langfuse are the two leaders, but the ecosystem includes:
- Arize Phoenix — Open-source, strong on embedding drift detection
- Weights & Biases Weave — Good for teams already using W&B for ML experiment tracking
- Datadog LLM Observability — Best for teams already paying for Datadog APM
- Braintrust — Focused on evaluation-first observability
For a detailed feature-by-feature comparison of LangSmith and Langfuse specifically, see our LangSmith vs Langfuse deep-dive.
Interview Questions
Section titled “Interview Questions”These four questions test whether you understand why LLMs need different monitoring than deterministic software, and how to use traces systematically to debug production failures.
Q1: “What is LLM observability, and how does it differ from traditional application monitoring?”
Section titled “Q1: “What is LLM observability, and how does it differ from traditional application monitoring?””What they’re testing: Do you understand why LLMs need different monitoring than deterministic software?
Strong answer: “Traditional monitoring tracks request/response success and latency. LLM observability adds three dimensions that traditional APM doesn’t cover: trace-level token and cost accounting, quality evaluation scoring on outputs, and prompt version tracking. The core difference is non-determinism — the same input can produce different outputs, so you need quality scoring to detect regressions, not just error rate monitoring.”
Weak answer: “You need to monitor AI differently because it uses more resources.”
Q2: “Your RAG app’s answer quality dropped 15% this week. Walk me through how you’d debug this with an observability tool.”
Section titled “Q2: “Your RAG app’s answer quality dropped 15% this week. Walk me through how you’d debug this with an observability tool.””What they’re testing: Can you use traces to systematically diagnose an LLM pipeline issue?
Strong answer: “First, I’d filter traces by the quality evaluation score and look at the bottom 10% this week versus last week. Then I’d compare the trace trees. Specifically, I’d check the retriever spans — are they returning different documents? If retrieval quality dropped, the issue is in the vector index or embedding model. If retrieval looks fine but the generation span shows worse outputs, I’d check if the model version changed or if the system prompt was modified. Traces let you isolate the failing span instead of guessing.”
Q3: “How would you set up cost monitoring for an LLM application serving 100k requests/day?”
Section titled “Q3: “How would you set up cost monitoring for an LLM application serving 100k requests/day?””What they’re testing: Production cost awareness and proactive monitoring.
Strong answer: “I’d instrument every LLM call with token counts and model pricing. Langfuse does this automatically — each trace includes input tokens, output tokens, and computed cost. I’d build three views: cost per request (to spot expensive outliers), cost per user (to identify heavy users for rate limiting), and cost per feature (to find which pipelines are most expensive). Alerts fire when hourly cost exceeds 2x the rolling 7-day average.”
Q4: “Your team uses LangChain today but plans to add a custom retriever and a LlamaIndex component. How does this affect your observability tooling choice?”
Section titled “Q4: “Your team uses LangChain today but plans to add a custom retriever and a LlamaIndex component. How does this affect your observability tooling choice?””What they’re testing: Architectural foresight and vendor lock-in awareness.
Strong answer: “This is exactly when Langfuse beats LangSmith. Langfuse is framework-agnostic — I can trace LangChain calls, custom Python functions, and LlamaIndex pipelines in the same trace tree using the @observe() decorator. LangSmith’s auto-tracing only works for LangChain. The custom retriever and LlamaIndex components would be blind spots unless I manually instrument them with the LangSmith SDK, which adds friction.”
Production Deployment Tips
Section titled “Production Deployment Tips”Here’s what teams running LLM observability at scale actually care about:
Start with 100% tracing, sample for evaluation. Trace every request — the storage cost is low (a few cents per 10k traces on Langfuse Cloud). But run evaluations on a 1-5% sample. LLM-as-judge scoring is expensive (you’re calling another LLM per evaluation), so sample strategically: evaluate all traces with high token counts, all error traces, and a random sample of successful traces.
Separate dev and production trace projects. Both LangSmith and Langfuse support projects/environments. Never mix dev traces with production data — dev traces are noisy, high-volume, and skew your dashboards. Use separate API keys per environment.
Connect observability to your deployment pipeline. The strongest pattern: run automated evaluations on a golden dataset before each deployment. If quality scores drop below a threshold, block the deployment. This turns observability from “debugging after the fact” into “preventing regressions before they ship.” Both platforms support this via their APIs.
Export to your existing monitoring stack. Teams don’t want a separate dashboard for LLM monitoring. Export key metrics (cost, latency, error rate, quality scores) to Datadog, Grafana, or whatever your team already uses. Langfuse’s open-source backend makes this straightforward — query the PostgreSQL database directly. LangSmith offers API exports.
Track prompt versions. Every time you change a system prompt, tag the trace with the prompt version. When quality drops, you can instantly correlate it with a specific prompt change. Both platforms support metadata tagging, but LangSmith has a dedicated Prompt Hub that versions prompts as first-class objects.
For related system design patterns in production LLM apps, see our architecture guide.
Summary and Key Takeaways
Section titled “Summary and Key Takeaways”- LLM observability adapts the three monitoring pillars — traces (request lifecycle), metrics (latency, cost, tokens), and evaluations (quality scoring) — for non-deterministic systems
- OpenTelemetry
gen_ai.*semantic conventions are the vendor-neutral standard for LLM instrumentation in 2026 - LangSmith wins on zero-config setup — one environment variable and every LangChain call is traced automatically
- Langfuse wins on flexibility — framework-agnostic, self-hostable, open-source (MIT), and deeper cost analytics
- Track these metrics from day one: latency p95, token cost per request, error rate, and quality evaluation scores
- Set three alerts: cost spikes (2x average), quality degradation (>10% drop), and latency outliers (p99 > 10s)
- Start with 100% tracing, sample 1-5% for evaluations — traces are cheap, LLM-as-judge scoring is not
- Connect observability to your deployment pipeline to block regressions before they reach production
Related
Section titled “Related”- LangSmith vs Langfuse — Feature-by-feature comparison of the two leading tools
- LLM Evaluation — Connect traces to evaluation workflows for quality scoring
- GenAI System Design — Architecture patterns for production LLM applications
- RAG Architecture — The most common pipeline to observe and debug
- Pydantic AI — Type-safe agent framework with built-in observability hooks
Last updated: March 2026 | OpenTelemetry 1.30+ / LangSmith v2.x / Langfuse v2.x
Frequently Asked Questions
What is LLM observability?
LLM observability is the practice of tracing, debugging, and monitoring LLM applications in production. Unlike traditional logging which captures only input and output strings, LLM observability provides full traces of what happened between — retrieval steps, prompt assembly, model calls, tool use, and token consumption. It bridges the gap between 'works on my machine' and 'fails for 12% of users' with traces, metrics, and evaluations.
How do LangSmith and Langfuse compare for LLM observability?
LangSmith is LangChain's managed tracing platform — zero-config integration with LangChain, hosted evaluation, and prompt playground. Langfuse is open-source and self-hostable — full control over data, works with any framework, and can be deployed on your own infrastructure. For a detailed feature-by-feature breakdown, see our LangSmith vs Langfuse deep-dive.
Why do LLM apps need different monitoring than traditional apps?
LLM applications are non-deterministic — the same input can produce different outputs. Traditional metrics like uptime and error rate are insufficient. You need to track quality metrics (evaluation scores, hallucination rate), cost metrics (token usage per request, daily spend), and latency metrics (time to first token, retrieval time, total response time). Quality can degrade silently without errors, making continuous evaluation essential.
What should I trace in an LLM application?
Trace the full request lifecycle: user input, prompt assembly (system prompt + context + user query), retrieval results (what chunks were returned and their relevance scores), model call details (model name, token count, latency), tool calls and results, and the final output. Include metadata like user ID, session ID, and prompt version. This trace-level visibility is essential for debugging non-deterministic failures in production.
Why does LLM observability matter for production applications?
Teams that skip LLM observability spend 3-5x longer debugging issues because non-deterministic systems produce failures that traditional logging cannot explain. Without traces, a bug report of 'the AI gives wrong answers' is nearly impossible to diagnose. Observability turns that into 'the retriever returned 0 relevant chunks at step 3,' making root cause analysis fast and systematic.
What are the key metrics to track for LLM observability?
The five essential metrics are latency p95 (alert if over 5 seconds for chat), token cost per request (alert at 2x average), error rate (alert above 1% of requests), quality evaluation scores from LLM-as-judge (alert on 10% week-over-week drop), and retrieval relevance on sampled traces (alert below 70%). Tracking these from day one catches cost spikes, quality regressions, and latency outliers before users notice.
What is the difference between LLM tracing and traditional logging?
Traditional logging captures a prompt string and a response string. LLM tracing captures the full request lifecycle as a tree of spans — each step (embedding, retrieval, reranking, generation) is a separate span with its own inputs, outputs, latency, and token counts. When a 5-step agent pipeline fails, tracing shows you which step failed and why, while logging only shows the final output.
How does LLM observability work with LangChain?
LangSmith provides zero-config tracing for LangChain by setting one environment variable (LANGCHAIN_TRACING_V2=true). Every chain, agent, and tool call is captured automatically. Langfuse integrates with LangChain via a CallbackHandler — a one-line setup that traces all LangChain operations. Both produce the same trace tree of spans showing each pipeline step.
What is the cost of not monitoring LLM applications?
Without observability, teams face 3-5x longer debugging cycles, undetected cost spikes from prompt changes or retry loops burning tokens overnight, silent quality degradation where model updates cause regressions users notice before engineers do, and latency outliers from runaway agent loops or exhausted connection pools. A single undetected prompt template change can double your token costs within hours.
How do I get started with LLM observability?
Start by choosing a platform based on your stack. If you use LangChain exclusively, LangSmith gives you zero-config tracing with one environment variable. If you need self-hosting or use multiple frameworks, Langfuse works with any LLM library. Begin with 100% tracing (storage is cheap) and sample 1-5% of traces for LLM-as-judge evaluations. Set up three alerts — cost spikes, quality drops, and latency outliers — and you will catch 90% of production issues.