Skip to content

LangSmith vs Langfuse: LLM Observability Compared (2026)

LangSmith and Langfuse are the two leading LLM observability platforms, and they solve the same core problem differently. LangSmith is LangChain’s commercial platform with the deepest LangChain/LangGraph integration. Langfuse is an open-source alternative that works with any framework and lets you self-host. If you’re choosing between them, the deciding factors are vendor lock-in tolerance, self-hosting requirements, and which LLM framework you use.

Who this is for:

  • Junior engineers: You need to debug why your LLM app gives wrong answers and want to see what happened inside each call
  • Senior engineers: You’re evaluating observability platforms for a production GenAI stack and need to make a defensible recommendation

You ship an LLM-powered feature. Users report “it sometimes gives wrong answers.” Without observability, you’re flying blind.

Here’s what you can’t answer without an LLM tracing tool:

QuestionWhy It Matters
Which prompt version caused the regression?You changed the system prompt last Tuesday — was that the trigger?
How much are you spending per request?Token costs vary 10-100x depending on model and prompt length
Where is latency coming from?Is it the retrieval step, the LLM call, or post-processing?
What did the user actually send vs. what the model received?Prompt templates can transform input in unexpected ways
Are your RAG retrievals returning relevant documents?Bad retrieval = bad answers, regardless of model quality

Both LangSmith and Langfuse answer these questions. The difference is how — and what trade-offs you accept.

Engineers who skip LLM observability spend 3-5x longer debugging production issues. Traces turn a “the AI is wrong” bug report into “the retriever returned 0 relevant chunks at step 3.”


Think of LLM observability like application performance monitoring (APM) — but for AI. Just as Datadog traces HTTP requests through microservices, LangSmith and Langfuse trace LLM calls through your agent pipeline.

The core concept in both platforms is a trace: a tree of operations (called spans) that represents one end-to-end execution. A trace for a RAG pipeline might look like:

Trace: "user asked about pricing"
├── Span: Embed query (OpenAI, 150 tokens, 120ms)
├── Span: Vector search (Pinecone, 5 results, 45ms)
├── Span: Rerank results (Cohere, 200ms)
└── Span: Generate answer (GPT-4o, 1,200 tokens, 1.8s)

Both platforms capture the same fundamental data: inputs, outputs, tokens, latency, cost, and metadata at each span. Where they diverge is in integration depth, pricing, and deployment model.


LangSmith traces LangChain apps with a single environment variable; Langfuse requires explicit SDK initialization but works with any framework.

Terminal window
pip install langsmith
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-api-key

That’s it. If you’re using LangChain or LangGraph, tracing is automatic — every chain, agent, and tool call is captured with zero code changes. This is LangSmith’s biggest advantage: if you’re already in the LangChain ecosystem, observability is a single environment variable.

Terminal window
pip install langfuse
from langfuse import Langfuse
# Initialize the client
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com" # or your self-hosted URL
)
# Manual tracing
trace = langfuse.trace(name="rag-pipeline", user_id="user-123")
span = trace.span(name="embed-query", input={"query": "pricing info"})
# ... your code ...
span.end(output={"embedding_dim": 1536})

For LangChain users, Langfuse also offers a drop-in callback:

from langfuse.callback import CallbackHandler
handler = CallbackHandler(
public_key="pk-...",
secret_key="sk-..."
)
# Pass to any LangChain chain or agent
result = chain.invoke({"input": "..."}, config={"callbacks": [handler]})

Both platforms insert an SDK layer between your application and a storage backend, but differ in whether that backend can be self-hosted.

Observability Stack Layers

How LLM observability platforms sit in your architecture

Dashboard
Trace explorer, cost analytics, evaluations
Storage & Processing
Trace ingestion, aggregation, indexing
SDK Layer
LangSmith SDK / Langfuse SDK — captures spans
Your Application
LLM calls, agent logic, RAG pipelines
Idle

LLM Observability Data Flow

How traces flow from your application to the observability dashboard

Your AppLLM calls + agent logic
LangChain / LangGraph / Custom
SDK captures spans automatically
SDK LayerCollects trace data
Serialize inputs/outputs/tokens
Batch and send async
BackendStorage + processing
LangSmith: managed cloud only
Langfuse: cloud or self-hosted
DashboardAnalysis + debugging
Trace explorer, cost analytics
Evaluation scores, prompt versioning
Idle

LangSmith vs Langfuse

LangSmith
LangChain's commercial observability platform
  • Zero-config tracing for LangChain/LangGraph
  • Deepest integration with LangChain ecosystem
  • Built-in prompt hub and versioning
  • Online evaluation with human feedback
  • No self-hosting option
  • Vendor lock-in to LangChain ecosystem
  • Limited free tier (5k traces/month)
VS
Langfuse
Open-source, framework-agnostic observability
  • Works with any framework (LangChain, LlamaIndex, custom)
  • Self-hosting option — full data sovereignty
  • Generous free tier (50k observations/month)
  • Open-source (MIT license)
  • Manual instrumentation for non-LangChain code
  • Smaller team, slower feature velocity
  • Self-hosting requires infrastructure management
Verdict: Use LangSmith if you're all-in on LangChain and want zero-config tracing. Use Langfuse if you need self-hosting, multi-framework support, or want to avoid vendor lock-in.
Use case
Choosing an LLM observability platform for production

The debugging workflow and cost tracking experience are nearly identical in both platforms — the difference is how traces get captured in the first place.

Here’s a real debugging workflow. Your RAG app returns “I don’t know” when users ask about pricing — but the docs clearly cover pricing.

In LangSmith: Open the trace. Click the retriever span. You see the query embedding returned 5 chunks — none about pricing. The problem isn’t the LLM; it’s the retriever. Fix the embedding or chunk strategy.

In Langfuse: Same workflow. Open the trace tree. Expand the retrieval span. Check the output field — the retrieved documents are about “product features,” not “pricing.” Same diagnosis, same fix.

The debugging experience is nearly identical. The difference is how you got the traces there in the first place.

Both platforms track token usage and cost per trace. Here’s what the data looks like:

MetricHow to Access (LangSmith)How to Access (Langfuse)
Cost per traceDashboard → Runs → Cost columnDashboard → Traces → Cost column
Daily spendDashboard → Analytics → Cost over timeDashboard → Analytics → Cost chart
Cost by modelFilter runs by model → aggregateFilter traces by model → aggregate
Cost by userTag runs with user_id metadataSet user_id on trace creation

The most common regret is choosing LangSmith for its LangChain integration, then adding a non-LangChain component with no way to trace it.

FeatureLangSmithLangfuse
Auto-tracing (LangChain)Yes — zero code changesYes — via callback handler
Auto-tracing (other frameworks)NoPartial (OpenAI, Anthropic SDKs)
Self-hostingNoYes (Docker, Kubernetes)
Open-sourceNoYes (MIT license)
Prompt managementFull hub with versioningBasic prompt versioning
EvaluationOnline + offline evalsOnline + offline evals
Dataset managementBuilt-inBuilt-in
Free tier5k traces/month50k observations/month
SOC 2 complianceYesYes (cloud), self-manage (self-hosted)
Multi-frameworkLangChain-firstFramework-agnostic
  • LangSmith: Free developer tier (5k traces/month). Plus plan starts at $39/seat/month. Enterprise is custom pricing. Traces beyond the free tier cost extra.
  • Langfuse: Self-hosted is free forever. Cloud free tier gives 50k observations/month. Pro plan starts at ~$59/month with higher limits. No per-seat pricing.

For a team of 5 engineers with 100k traces/month, Langfuse self-hosted costs $0 (plus your infrastructure). LangSmith would run $200-400/month depending on trace volume.


These questions test whether you understand the unique debugging challenges of non-deterministic LLM pipelines, not just platform feature lists.

Q1: “Why do you need LLM observability? Can’t you just log prompts and responses?”

Section titled “Q1: “Why do you need LLM observability? Can’t you just log prompts and responses?””

What they’re testing: Do you understand the unique challenges of debugging non-deterministic systems?

Strong answer: “Logging captures inputs and outputs, but LLM apps have multi-step pipelines where the failure point isn’t obvious. Observability gives you a trace tree — you can see that the retriever returned irrelevant chunks, or the prompt template injected stale context, or token costs spiked because the system prompt grew. Without traces, debugging a 5-step agent is guesswork.”

Weak answer: “It’s good practice to monitor your AI.”

Q2: “How would you evaluate LLM output quality at scale?”

Section titled “Q2: “How would you evaluate LLM output quality at scale?””

What they’re testing: Can you connect observability to evaluation workflows?

Strong answer: “Both LangSmith and Langfuse support evaluation scores attached to traces. You can run automated evals — like LLM-as-judge scoring relevance and faithfulness — and attach the scores to each trace. Then you filter by low scores to find systematic failure patterns. This is how you catch regressions before users report them.”

Q3: “Your team uses LangChain but is considering adding LlamaIndex for a specific feature. How does this affect your observability choice?”

Section titled “Q3: “Your team uses LangChain but is considering adding LlamaIndex for a specific feature. How does this affect your observability choice?””

What they’re testing: Architectural foresight — can you anticipate integration challenges?

Strong answer: “This is exactly where Langfuse shines over LangSmith. Langfuse traces both LangChain and LlamaIndex natively. With LangSmith, the LlamaIndex portion would be a blind spot unless you manually instrument it with the LangSmith SDK.”


At production scale, here’s what actually matters:

Data sovereignty: If you’re in a regulated industry (healthcare, finance), self-hosting Langfuse on your own infrastructure means trace data never leaves your VPC. LangSmith’s managed-only model is a non-starter for some compliance requirements.

Trace volume costs: A busy production app can generate 1M+ traces/month. At that scale, Langfuse self-hosted saves thousands per month compared to any managed service. The trade-off is you’re responsible for database scaling (PostgreSQL + ClickHouse).

Integration with existing monitoring: Both platforms offer API exports. Teams typically pipe LLM traces into their existing observability stack (Datadog, Grafana) for unified dashboards. Langfuse’s open-source nature makes this easier — you own the database and can query it directly.

Evaluation workflows: The winning pattern is: (1) trace everything in production, (2) sample 1-5% of traces for automated evaluation, (3) flag low-scoring traces for human review. Both platforms support this, but LangSmith’s evaluation UI is more polished.


Tracing tells you what happened on a specific request. Monitoring tells you how your system is performing across all requests over time. Most teams deploy tracing first and add monitoring later — but the best production setups run both from day one.

LLM monitoring has four layers. Tracing (what LangSmith and Langfuse provide) is only the bottom layer.

LLM Monitoring Stack

Tracing is the foundation. Production systems need all four layers.

Alerting & SLOs
Latency SLO breaches, cost anomalies, quality score drops — PagerDuty / Slack
Quality Scoring
LLM-as-judge evals, RAGAS scores, user feedback signals — sampled or exhaustive
Metrics & Aggregation
p50/p99 latency, error rate, tokens/day, cost/day — time-series dashboards
Tracing
Per-request spans — inputs, outputs, latency, tokens (LangSmith / Langfuse)
Idle
SignalWhere to TrackAlert Threshold
p99 latencyDatadog / Grafana (from trace exports)>5s for interactive, >30s for async
Error rateApplication metrics + trace error spans>1% of requests
Daily token spendLangSmith/Langfuse cost dashboard>150% of 7-day average
Hallucination rateLLM-as-judge sampling (1-5% of traces)>10% of sampled responses
No-answer rateApplication logs + trace metadataSudden spike (>2x baseline)
User feedbackThumbs up/down in your UI → trace metadataSatisfaction drop below 80%
Retrieval relevanceRAGAS context precision on sampled tracesScore drop below 0.7

Both LangSmith and Langfuse expose APIs to export trace data. The production pattern: pipe LLM traces into your existing observability stack for unified dashboards.

# Example: Export Langfuse traces to Prometheus metrics
from langfuse import Langfuse
from prometheus_client import Histogram, Counter
llm_latency = Histogram("llm_latency_seconds", "LLM call latency", ["model"])
llm_errors = Counter("llm_errors_total", "LLM call errors", ["model", "error_type"])
def track_trace(trace):
"""Called after each LLM interaction completes."""
for span in trace.observations:
if span.type == "generation":
llm_latency.labels(model=span.model).observe(span.end_time - span.start_time)
if span.status_message == "error":
llm_errors.labels(model=span.model, error_type=span.output.get("error", "unknown")).inc()

The broader monitoring ecosystem includes tools like Helicone (proxy-based, zero-code), Portkey (gateway + observability), and Arize Phoenix (open-source evaluation). These complement rather than replace LangSmith/Langfuse — use them when you need gateway-level metrics or specialized evaluation workflows.


  • LangSmith is best when you’re 100% in the LangChain/LangGraph ecosystem and want zero-config tracing
  • Langfuse is best when you need self-hosting, multi-framework support, or a generous free tier
  • Both platforms capture the same core data: traces, spans, tokens, cost, latency, and evaluation scores
  • Self-hosting is Langfuse’s killer feature — free, full data sovereignty, MIT-licensed
  • Zero-config tracing is LangSmith’s killer feature — one env var and every LangChain call is traced
  • Don’t skip LLM observability — debugging production AI without traces is 3-5x slower than with them
  • Choose based on your stack’s future, not just today — multi-framework needs favor Langfuse

Last updated: February 2026 | LangSmith v2.x / Langfuse v2.x

Frequently Asked Questions

What is the difference between LangSmith and Langfuse?

LangSmith is LangChain's commercial observability platform — tightly integrated with LangChain and LangGraph, with managed hosting only. Langfuse is an open-source alternative that works with any LLM framework, supports self-hosting, and offers a generous free tier. LangSmith has deeper LangChain integration; Langfuse has broader framework support and data sovereignty options.

Is Langfuse really free?

Langfuse's self-hosted version is fully free and open-source (MIT license). The managed cloud version has a free tier with 50k observations/month. For teams needing more volume, paid plans start at around $59/month. LangSmith offers a free developer tier but requires paid plans for team features.

Can I use Langfuse with LangChain?

Yes. Langfuse has a native LangChain integration via the CallbackHandler. You can use Langfuse as your observability backend while still building with LangChain, LangGraph, or any other framework. The integration is a one-line setup.

Which is better for production: LangSmith or Langfuse?

It depends on your stack and requirements. Choose LangSmith if you're deeply invested in the LangChain ecosystem and want the tightest integration. Choose Langfuse if you need self-hosting, multi-framework support, or want to avoid vendor lock-in. Both are production-ready.

How does pricing compare between LangSmith and Langfuse?

LangSmith offers a free developer tier with 5k traces/month, with the Plus plan starting at $39/seat/month. Langfuse self-hosted is free forever (MIT license), and the managed cloud version has a free tier with 50k observations/month and a Pro plan at around $59/month with no per-seat pricing. For a team of 5 engineers with 100k traces/month, Langfuse self-hosted costs $0 plus infrastructure, while LangSmith runs $200-400/month.

Is LangSmith open-source?

No. LangSmith is a proprietary, managed-only platform from the LangChain team. There is no self-hosted option. Langfuse is the open-source alternative, released under the MIT license, which allows full self-hosting and data sovereignty. If open-source and self-hosting are requirements, Langfuse is the only option between the two.

What tracing capabilities do LangSmith and Langfuse offer?

Both platforms capture traces as trees of spans showing inputs, outputs, tokens, latency, and cost at each step. LangSmith provides zero-config auto-tracing for LangChain and LangGraph apps with a single environment variable. Langfuse requires explicit SDK initialization but supports auto-tracing for OpenAI, Anthropic, and other SDKs in addition to LangChain via its CallbackHandler.

How do evaluation features compare between LangSmith and Langfuse?

Both platforms support online and offline evaluations with scores attached to traces. LangSmith has a more polished evaluation UI with built-in dataset management and a dedicated Prompt Hub for versioning prompts. Langfuse supports LLM-as-judge scoring, custom evaluation functions, and user feedback signals. Both let you filter traces by evaluation scores to find systematic failure patterns.

Can I self-host Langfuse?

Yes. Langfuse can be self-hosted using Docker or Kubernetes with Helm charts, and ships one-click deploy options for Railway and Render. It requires PostgreSQL and ClickHouse, costing roughly $100-300/month on AWS or GCP for a production-grade setup. The real savings come at high trace volumes (1M+ traces/month) where managed pricing becomes expensive. Self-hosting gives you full data sovereignty — trace data never leaves your VPC.

Which should I choose for a startup vs enterprise?

For startups, Langfuse is often the better choice — the generous free tier and open-source self-hosting keep costs low while you scale. For enterprise teams already invested in LangChain, LangSmith's zero-config tracing reduces setup friction. However, enterprises in regulated industries often choose Langfuse self-hosted for data sovereignty since LangSmith's managed-only model can be a compliance blocker. See our LLM observability guide for the full monitoring strategy.