How do you debug an AI agent?

Debugging an AI agent requires trace-level observability, not just logs. You need to capture every LLM call, its inputs and outputs, every tool invocation, parameter values, return values, and timing for each iteration of the ReAct loop. Tools like LangSmith and Langfuse provide this automatically for LangGraph agents. Without a full execution trace, reproducing and diagnosing agent failures is nearly impossible because agent behavior is non-deterministic and depends on the full chain of prior decisions.

How does agent error recovery work?

Agent error recovery uses a layered strategy: at the tool level, implement retry with exponential backoff for transient failures and structured error messages the LLM can reason about; at the reasoning level, provide fallback tools or alternative strategies when a primary tool fails; at the task level, use checkpointing to resume from a known-good state rather than restarting from scratch. LangGraph's persistence layer supports checkpoint-based recovery natively. The key is to never let a tool failure silently corrupt agent state — always surface errors explicitly so the LLM can adapt its plan.

How do you trace agent execution?

Agent execution tracing captures the full sequence of steps an agent takes to complete a task: the initial request, each reasoning step (thought), each tool call with its parameters, each tool response, and the final output. LangSmith automatically traces LangChain and LangGraph agents by setting the LANGCHAIN_TRACING_V2 environment variable. Langfuse provides framework-agnostic tracing via SDK wrappers. Both store traces in a searchable UI where you can replay, compare, and annotate executions. Always assign a unique trace ID per user session so you can correlate traces to support tickets.

How do you prevent agent infinite loops?

Preventing infinite loops requires multiple safeguards: always set a maximum iteration limit (10-20 steps for most tasks), implement a recursion_limit in LangGraph's graph configuration, add cost-based limits that terminate execution when total token spend exceeds a threshold, use time-based circuit breakers for long-running agents, and design tool outputs to signal completion clearly so the LLM knows when to stop. Additionally, monitor for loop detection patterns — if the agent calls the same tool with identical parameters twice in a row, it is likely stuck and should be interrupted.

What are the three categories of agent errors?

Agent errors fall into three categories: tool failures (network timeouts, rate limits, malformed parameters, empty results), LLM reasoning errors (hallucinated tool calls, parameter hallucination, reasoning drift, premature termination), and state corruption (silent inconsistencies where the agent continues executing on a corrupted foundation). Each category requires a different response strategy — retry for tool failures, validation for reasoning errors, and explicit state assertions for corruption.

How does checkpoint-based recovery work for agents?

Checkpoint-based recovery uses LangGraph's persistence layer to save agent state at each graph node. If an agent fails at step 8 of a 15-step task, it resumes from step 7's checkpoint rather than restarting from scratch. LangGraph supports SQLite for development, PostgreSQL for production, and Redis for high-throughput workloads. Store checkpoint data with a TTL appropriate for your use case — session-scoped checkpoints can expire after 24 hours.

What metrics should you monitor for production agent systems?

Beyond standard latency and error rate, monitor average iterations per invocation (increasing means design degradation), P95 token cost per invocation (catches cost spikes), tool error rate by tool name (detects specific tool degradation), iteration limit hit rate (agents getting stuck), loop detection trigger rate (reasoning oscillation), task completion rate (below 90% is a P1 incident), and context window utilization (above 80% warrants review).

Why is agent state corruption dangerous?

State corruption is the most dangerous category of agent error because it is silent — it produces no exception or error message. The agent continues executing on a corrupted foundation and may produce responses that look plausible but are wrong. Examples include a tool returning malformed JSON parsed as empty, a summarization step losing a critical fact, or a state update writing to the wrong key. Detection requires explicit validation at each state transition.

How do you test agent error handling before production?

Testing agent error handling requires a layered strategy: unit tests for each tool's error handling (timeout, bad parameters, partial results), integration tests that inject synthetic errors using mocks, evaluation datasets built from real production failures captured via tracing, and chaos testing in staging that randomly injects tool errors at low rates to verify recovery paths. Every production incident should become a regression test case.

Agent Debugging & Error Handling — Tracing, Logging & Recovery (2026)

Q: What is the agent observability stack?

The agent observability stack has four layers: infrastructure and cost monitoring (token usage, API costs, latency via Datadog or CloudWatch), execution tracing (per-invocation traces via LangSmith or Langfuse), structured event logging (business-level events like task completion, tool errors, loop detection), and semantic quality evaluation (LLM-as-judge scoring, hallucination detection, user feedback). Each layer catches different failure modes that the others miss.

Production agents fail in ways that RAG pipelines never do. A retrieval system either returns documents or it doesn’t — the failure surface is narrow and well-understood. An agent can get stuck in a reasoning loop, misinterpret a tool error as valid data, exhaust its context window mid-task, or silently produce wrong results that look correct. Debugging these failures requires a completely different approach from standard application debugging.

This guide covers the full observability and error handling stack for production agents: execution tracing, structured logging, error classification, recovery patterns, and the techniques that prevent infinite loops before they reach production.

1. Why Agent Debugging Matters

Agent failures are non-deterministic and path-dependent — debugging them requires trace-level observability, not log-level debugging.

Why Standard Debugging Does Not Work for Agents

When a conventional web service returns a 500 error, you look at the stack trace, find the failing line, and fix it. The failure is deterministic — the same inputs produce the same error every time. You can reproduce it locally and write a unit test that pins the fix.

Agent failures work differently. An agent that misclassifies a tool error, calls the wrong tool because of an ambiguous description, or gets stuck in a reasoning loop may succeed nine times and fail on the tenth — with no code change between the two runs. The failure depends on the LLM’s non-deterministic output, the specific wording of the user’s request, the sequence of tool results, and the accumulated context at each step.

This creates a debugging gap. Most engineers building their first agents try to debug them the same way they debug APIs: add a few print statements, check the final output, look for obvious errors. This works for simple cases. It fails completely for any agent running more than three or four steps, using more than two tools, or operating across multiple sessions.

The correct mental model is this: agent debugging is trace-level observability, not log-level debugging. You need to capture and inspect the full sequence of thoughts, actions, and observations — not just inputs and outputs.

What You Will Learn

This guide covers:

Why agent debugging requires different tooling than standard application monitoring
How to implement execution tracing with LangSmith and Langfuse
The observability stack every production agent system needs
How to classify and handle the three categories of agent errors
Recovery patterns: retry, fallback, and checkpoint-based resumption
Techniques for detecting and preventing infinite loops
Production monitoring and alerting for agent systems
What interviewers ask about agent reliability and how to answer

2. Why Agent Debugging Is Hard

Non-determinism at the LLM, tool result, and context accumulation layers compounds across steps, making agent failures nearly impossible to reproduce without a full execution trace.

Non-Determinism at Every Layer

Agents introduce non-determinism at three distinct layers, and each layer compounds the previous one.

Layer 1: LLM non-determinism. With temperature > 0, the same prompt generates different outputs on different runs. Even with temperature = 0, model updates, tokenization changes, or context window differences can change behavior. An agent that worked perfectly in development may behave differently in production after a model version update.

Layer 2: Tool result variability. Tools call external APIs, databases, and services. Those services have their own state, rate limits, and failure modes. A web search tool returns different results depending on when you call it. A database query returns different results as data changes. The agent’s reasoning path depends on these results, so variability in tools produces variability in agent behavior even if the LLM is deterministic.

Layer 3: Context accumulation. Each iteration adds content to the context window. The LLM’s reasoning at step 7 depends on everything that happened in steps 1 through 6. A small deviation in step 3 — a slightly different tool result, a subtly different phrasings of a thought — can cascade into a completely different decision at step 7. This path dependence makes failures extremely hard to reproduce.

The Reproduction Problem

In standard software, reproducing a bug means running the same code with the same inputs. In agent systems, reproducing a failure means replaying the exact sequence of LLM outputs and tool results, in the exact order, in the exact context state they occurred. Without a trace recording system, this is impossible.

This is why execution tracing is not optional for production agents — it is the prerequisite for any debugging at all.

State Corruption Is Silent

Unlike a stack overflow or a null pointer exception, agent state corruption produces no error. The agent continues executing. It may even produce a response that looks plausible. But the internal state — the accumulated context, the task progress, the memory — is wrong, and every subsequent step builds on that corrupted foundation.

Examples of silent state corruption:

A tool returns malformed JSON; the agent parses it as empty rather than an error, silently losing data
A summarization step loses a critical fact; subsequent reasoning proceeds without it
A state update writes to the wrong key; the agent references the old value in later steps

Detecting these requires explicit validation at each state transition — not just checking for exceptions, but asserting that the state is what you expect it to be.

3. Execution Tracing

A complete execution trace records every LLM call, tool invocation, state snapshot, and timing for each iteration — without it, reproducing any agent failure is nearly impossible.

What Execution Tracing Captures

A complete execution trace for one agent invocation includes:

Run metadata: unique run ID, session ID, user ID, start/end timestamp, total tokens, total cost, model version
Each LLM call: full prompt (system + messages), raw output, token counts, latency, model parameters
Each tool call: tool name, input parameters, raw output, execution time, success/failure status
Reasoning steps: the agent’s explicit thoughts at each iteration
State snapshots: the agent’s internal state at each checkpoint (LangGraph nodes, memory updates)
Final output: the response delivered to the user, with evaluation metadata if available

This is significantly more data than standard application logs. A 10-step agent invocation with 5 tool calls generates perhaps 50,000–200,000 tokens of trace data. At scale — thousands of invocations per day — this requires a purpose-built trace storage and query system.

LangSmith for LangGraph Agents

LangSmith is the native tracing solution for LangGraph. Setup requires minimal code change: set two environment variables, and all LangGraph agent invocations are traced automatically.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "production-agent"

# LangGraph agent code — no other changes needed
from langgraph.graph import StateGraph
# ... your agent definition

LangSmith captures the complete LangGraph execution tree, with each node as a separate span. You can inspect the full state at every node transition, replay any trace, and compare traces side-by-side. For multi-agent systems with supervisor patterns, LangSmith visualizes the entire invocation tree including all sub-agent calls.

Key LangSmith features for debugging:

Trace timeline: visual view of execution sequence with timing per step
State diff: shows exactly what changed in agent state at each node
Token cost breakdown: per-step token usage and estimated cost
Dataset creation: mark any trace as a test case for evaluation
Human annotation: add correctness labels to traces for building evaluation datasets

For a detailed comparison of LangSmith vs Langfuse, see LangSmith vs Langfuse.

Langfuse for Framework-Agnostic Tracing

Langfuse provides tracing for any LLM framework via manual instrumentation or auto-instrumentation decorators. It is well-suited for custom agents or systems that combine multiple frameworks.

from langfuse.decorators import observe, langfuse_context

@observe()
def run_agent(user_message: str, session_id: str):
    langfuse_context.update_current_trace(
        session_id=session_id,
        user_id="user-123",
        tags=["production", "customer-support"]
    )
    # Agent execution here
    return result

@observe(name="tool:search_knowledge_base")
def search_knowledge_base(query: str) -> str:
    # Tool implementation
    return results

The @observe decorator automatically captures function name, inputs, outputs, latency, and any exceptions. Langfuse stores traces in an open-source backend you can self-host, which is important for compliance use cases where traces must not leave your infrastructure.

Visual Explanation

📊 Visual Explanation

Agent Execution Trace — What Gets Captured at Each Step

A complete trace records every decision point, tool call, and state transition. Without this, reproducing and diagnosing agent failures is nearly impossible.

User RequestTrace initialized

Run ID assigned

Session context loaded

System prompt assembled

LLM ReasoningThought + action captured

Full prompt logged

Raw LLM output

Token count + cost

Tool ExecutionCall + result logged

Tool name + params

Execution time

Return value or error

State UpdateState snapshot saved

State diff recorded

Memory changes

Continue or terminate

Idle

4. Observability Stack

Production agent observability requires four layers — infrastructure metrics, execution traces, structured event logs, and semantic quality evaluation — each catching distinct failure modes.

The Four Layers of Agent Observability

Standard application monitoring tools — dashboards showing request rates, latency histograms, and error rates — are necessary but not sufficient for production agents. You need four additional layers of observability that standard tooling does not provide.

📊 Visual Explanation

Agent Observability Stack

Four layers of visibility, from infrastructure metrics up to semantic quality evaluation. Each layer catches different failure modes.

Infrastructure & Cost Monitoring

Token usage, API costs, latency P50/P95/P99, rate limit errors, model availability — standard observability tools (Datadog, CloudWatch)

Execution Tracing

Per-invocation trace: LLM calls, tool calls, state transitions, iteration count, context window utilization — LangSmith or Langfuse

Structured Event Logging

Business-level events: task started/completed/failed, tool error categories, loop detection triggers, human interrupt events — your application logs

Semantic Quality Evaluation

LLM-as-judge scoring, task completion rates, hallucination detection, user feedback signals — evaluation pipelines

Idle

Layer 1 — Infrastructure & Cost Monitoring catches operational failures: the model API is down, you are hitting rate limits, latency is spiking, costs are running over budget. Standard APM tools handle this layer well.

Layer 2 — Execution Tracing catches behavioral failures: the agent took the wrong path, a tool returned unexpected data, reasoning diverged from expected patterns. LangSmith and Langfuse handle this layer.

Layer 3 — Structured Event Logging bridges infrastructure and behavior. Log discrete, parseable events for every significant agent decision: {event: "tool_error", tool: "search_api", error_type: "timeout", retry_count: 2, session_id: "..."}. These logs power alerting dashboards and let you answer operational questions like “how often does the search tool time out on Tuesdays?”

Layer 4 — Semantic Quality Evaluation catches output quality issues: the agent completed successfully by all technical measures but produced a wrong answer. This layer requires LLM-as-judge evaluation or human review pipelines. See LLMOps guide for detailed evaluation pipeline architecture.

What to Log at Each Tool Boundary

Every tool call should emit a structured log event with at minimum:

import json, time, logging
from datetime import datetime, timezone

logger = logging.getLogger("agent.tools")

def instrumented_tool_call(tool_name: str, params: dict, session_id: str):
    start = time.monotonic()
    event = {
        "event": "tool_call_start",
        "tool": tool_name,
        "params": params,
        "session_id": session_id,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    logger.info(json.dumps(event))

    try:
        result = execute_tool(tool_name, params)
        duration_ms = int((time.monotonic() - start) * 1000)
        logger.info(json.dumps({
            "event": "tool_call_success",
            "tool": tool_name,
            "duration_ms": duration_ms,
            "result_length": len(str(result)),
            "session_id": session_id,
        }))
        return result
    except Exception as e:
        duration_ms = int((time.monotonic() - start) * 1000)
        logger.error(json.dumps({
            "event": "tool_call_error",
            "tool": tool_name,
            "error_type": type(e).__name__,
            "error_message": str(e),
            "duration_ms": duration_ms,
            "session_id": session_id,
        }))
        raise

Structured JSON logs are machine-parseable. They feed directly into dashboards, alerting rules, and anomaly detection pipelines. Unstructured text logs require regex parsing and break whenever the message format changes.

5. Error Classification

Agent errors fall into three categories — tool failures, LLM reasoning errors, and silent state corruption — each requiring a fundamentally different response strategy.

Three Categories of Agent Errors

Agent errors fall into three fundamentally different categories, each requiring a different response strategy.

Category 1: Tool Failures

The agent calls a tool, and the tool returns an error or unexpected output. This is the most common error category in production and the most tractable — tools are regular code, so most failures have known causes and predictable mitigations.

Subcategories:

Transient failures: network timeouts, rate limit errors, temporary service unavailability — resolvable with retry
Input validation errors: the LLM passed malformed parameters — resolvable with better tool descriptions and parameter validation
Semantic errors: the tool succeeded technically but returned empty or irrelevant results — requires fallback strategies
Partial failures: the tool returned some data but not all — requires explicit handling of incomplete results

Category 2: LLM Reasoning Errors

The LLM produces output that is syntactically valid but semantically wrong. These are harder to detect because they do not throw exceptions.

Subcategories:

Hallucinated tool calls: the LLM invents a tool name that does not exist — caught by tool dispatch validation
Parameter hallucination: valid tool name but nonsense parameters — caught by JSON Schema validation
Reasoning drift: the LLM loses track of the original goal after several tool calls and pursues a different objective
Premature termination: the LLM decides the task is complete when it is not
Over-iteration: the LLM keeps calling tools when it already has enough information to answer

Category 3: State Corruption

The agent’s internal state becomes inconsistent — a state update fails, a context summarization loses critical information, or a multi-agent message passes corrupted data.

State corruption is the most dangerous category because it is silent: the agent continues running on a corrupted foundation. Detection requires explicit state validation at each checkpoint.

6. Recovery Patterns

Three layered recovery patterns handle the full error spectrum: exponential backoff for transient failures, fallback strategies for persistent failures, and checkpoint-based resumption for long-running tasks.

Pattern 1: Retry with Exponential Backoff

The correct default for transient tool failures. Do not simply retry immediately — this hammers a service that is already under pressure. Use exponential backoff with jitter.

import asyncio, random

async def retry_with_backoff(
    tool_func,
    params: dict,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    retryable_errors: tuple = (TimeoutError, ConnectionError),
):
    for attempt in range(max_retries + 1):
        try:
            return await tool_func(**params)
        except retryable_errors as e:
            if attempt == max_retries:
                raise
            delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
            await asyncio.sleep(delay)
        except Exception:
            raise  # Non-retryable errors fail immediately

Always distinguish retryable from non-retryable errors. A 429 rate limit error is retryable. A 400 bad request error (malformed parameters) is not — retrying will produce the same failure.

Pattern 2: Fallback Strategies

When a tool fails after retries, or when it returns empty/irrelevant results, provide the LLM with an alternative path rather than terminating with an error.

Effective fallback designs:

Alternative tools: “If search_web fails, fall back to search_knowledge_base”
Degraded responses: return a cached result with a staleness warning rather than nothing
Explicit error messages: return a structured error the LLM can reason about — {"error": "search_unavailable", "suggestion": "Use the knowledge_base_search tool instead"}
Human escalation: for high-stakes failures, surface a human-in-the-loop interrupt rather than guessing

The key design principle: never let tool errors silently return empty strings or None. When the LLM receives an empty response, it does not know whether the tool found nothing or failed. This ambiguity is a primary driver of infinite loops — the agent keeps retrying a tool that failed, mistaking silence for “no results found.”

Pattern 3: Checkpoint-Based Recovery

For long-running agents, implement checkpointing so the agent can resume from a known-good state after a failure rather than restarting from scratch.

LangGraph provides built-in checkpointing via persistence backends:

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

# Configure checkpointer
checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:password@localhost/agent_checkpoints"
)

# Attach to graph
graph = workflow.compile(checkpointer=checkpointer)

# Resume interrupted run
config = {"configurable": {"thread_id": "session-abc-123"}}
result = await graph.ainvoke(
    {"messages": [HumanMessage(content="Resume task")]},
    config=config
)

With checkpointing, if an agent fails at step 8 of a 15-step task, it resumes from step 7’s checkpoint rather than starting over. For tasks that involve expensive tool calls (web scraping, database queries, LLM calls), this is not just convenient — it can save significant time and cost.

LangGraph’s persistence layer supports SQLite (development), PostgreSQL (production), and Redis (high-throughput). Store checkpoint data with a TTL appropriate for your use case — session-scoped checkpoints can expire after 24 hours; task-scoped checkpoints may need longer retention.

Visual Explanation

📊 Visual Explanation

Error Recovery Decision Tree

Three-layer recovery strategy: first retry transient errors, then fall back to alternatives, then checkpoint-resume or human escalation.

Tool Error Detected

Classify error type

Parse error type

Check retry count

Evaluate severity

Retry Strategy

Transient errors

Exponential backoff

Max 3 attempts

Jitter to avoid thundering herd

Fallback Strategy

Persistent errors

Alternative tool

Cached result

Structured error to LLM

Escalation

Unrecoverable errors

Checkpoint state

Human interrupt

Graceful termination

Idle

7. Preventing Infinite Loops

Infinite loops occur when an agent cannot make forward progress but does not recognize it is stuck — prevent them with hard iteration limits, signature-based loop detection, and cost circuit breakers.

Why Agents Loop

An infinite loop occurs when the agent cannot make forward progress but does not recognize that it is stuck. Common triggers:

A tool returns an ambiguous response (empty, partial, or unintelligible) and the LLM interprets it as “try again with different parameters”
The LLM’s reasoning oscillates between two approaches without committing to either
A tool succeeds, but the result does not contain what the LLM expected, so it retries with the same parameters
A multi-agent handoff fails silently, and the supervisor keeps re-delegating the same task

Iteration Limits

The most important safeguard. Every agent must have a hard iteration limit that terminates execution and raises an explicit error, not a silent empty response.

# LangGraph iteration limit
config = {
    "recursion_limit": 15,  # Maximum graph iterations
    "configurable": {"thread_id": session_id}
}

try:
    result = await graph.ainvoke(input_state, config=config)
except GraphRecursionError as e:
    logger.error(json.dumps({
        "event": "iteration_limit_exceeded",
        "session_id": session_id,
        "iteration_count": 15,
    }))
    return {"error": "Task complexity exceeded processing limit. Please simplify your request."}

For most production use cases, 10–20 iterations is sufficient. Tasks requiring more iterations are usually misdesigned — they should be broken into smaller subtasks. If you consistently see agents hitting iteration limits, that is a signal to redesign the task decomposition, not to raise the limit.

Loop Detection

Beyond iteration limits, implement active loop detection. If the agent calls the same tool with the same parameters twice in a row, it is stuck:

from collections import defaultdict
from hashlib import md5
import json

class LoopDetector:
    def __init__(self, threshold: int = 2):
        self.threshold = threshold
        self.call_history: list[str] = []

    def check(self, tool_name: str, params: dict) -> bool:
        """Returns True if a loop is detected."""
        call_signature = md5(
            json.dumps({"tool": tool_name, "params": params}, sort_keys=True).encode()
        ).hexdigest()
        self.call_history.append(call_signature)

        # Check last N calls for repeated signature
        recent = self.call_history[-self.threshold:]
        return len(set(recent)) == 1 and len(recent) == self.threshold

    def reset(self):
        self.call_history.clear()

When a loop is detected, do not simply terminate — provide the LLM with explicit guidance: “You have called search_web with the same query twice. The tool appears to have limited results for this query. Consider using a different query, a different tool, or answering based on what you already know.”

Cost and Time Circuit Breakers

For production systems with hard cost or latency budgets:

from dataclasses import dataclass, field
import time

@dataclass
class AgentBudget:
    max_tokens: int = 50_000
    max_cost_usd: float = 0.50
    max_duration_seconds: float = 120.0
    tokens_used: int = 0
    cost_usd: float = 0.0
    start_time: float = field(default_factory=time.monotonic)

    def check(self) -> tuple[bool, str]:
        if self.tokens_used >= self.max_tokens:
            return False, f"Token budget exceeded ({self.tokens_used} tokens)"
        if self.cost_usd >= self.max_cost_usd:
            return False, f"Cost budget exceeded (${self.cost_usd:.4f})"
        elapsed = time.monotonic() - self.start_time
        if elapsed >= self.max_duration_seconds:
            return False, f"Time budget exceeded ({elapsed:.1f}s)"
        return True, "ok"

Circuit breakers should fail gracefully — return a partial result with a clear explanation, not a silent empty response.

8. Production Monitoring

Agent-specific metrics — iteration count per invocation, loop detection trigger rate, and task completion rate — reveal failure modes that standard latency and error rate dashboards miss entirely.

Metrics That Matter for Agent Systems

Standard web service metrics (request rate, latency, error rate) apply to agents but miss the most important signals. Add these agent-specific metrics to your monitoring dashboard:

Metric	What It Detects	Alert Threshold
Average iterations per invocation	Increasing = agent design degradation	> 2× baseline
P95 token cost per invocation	Cost spikes	> 3× median
Tool error rate by tool name	Specific tool degradation	> 5% for any tool
Iteration limit hit rate	Agents getting stuck	> 1% of invocations
Loop detection trigger rate	Reasoning oscillation	> 0.5% of invocations
Task completion rate	Overall agent reliability	< 90% is a P1 incident
Context window utilization	Approaching limits	> 80% warrants review

Alerting Strategy

Agent alerting requires different thresholds and routing than standard services. Recommended structure:

P1 alerts (page on-call immediately):

Task completion rate drops below 85%
Cost per invocation spikes >5× in a 15-minute window
Model API returns 5xx errors

P2 alerts (ticket + next-business-day investigation):

Any tool’s error rate exceeds 10% in a rolling 1-hour window
Iteration limit hit rate exceeds 2%
Average iteration count increases >50% week-over-week

Weekly review (no immediate action):

Token cost trends
Tool usage distribution changes
Loop detection trigger frequency

Connecting Traces to Support Tickets

When a user reports an agent failure, you need to retrieve the exact execution trace for their session. Implement a session ID that flows from the user-facing request through every layer:

import uuid
from contextvars import ContextVar

current_session_id: ContextVar[str] = ContextVar("session_id")

def start_agent_session(user_id: str) -> str:
    session_id = f"{user_id}-{uuid.uuid4().hex[:8]}"
    current_session_id.set(session_id)

    # Pass to tracing backend
    langfuse_context.update_current_trace(
        session_id=session_id,
        user_id=user_id,
    )
    return session_id

Show users their session ID in error messages: “If you need support, reference session ID: abc-12345.” This makes it possible to retrieve the exact trace — with full thought/action/observation sequence — for any user-reported failure.

9. Agent Debugging Interview Questions

Agent reliability is a standard senior GenAI interview topic — interviewers test whether you have debugged non-deterministic systems at scale, not whether you know tool names.

What Interviewers Test in Agent Reliability Questions

Agent debugging and error handling is a standard senior GenAI engineering interview topic as of 2025–2026. Interviewers are not testing whether you know that LangSmith exists — they are testing whether you have thought through the operational reality of running agents at scale.

Question 1: “How would you debug a production agent that intermittently returns wrong answers?”

Strong answer structure:

Start with observability: you need the full execution trace for failing invocations, not just final outputs
Compare failing vs passing traces: look for the divergence point — where did the execution path differ?
Classify the error: is this a tool failure, an LLM reasoning error, or state corruption?
Reproduce in staging: use the trace to replay the exact input sequence
Fix and add to evaluation dataset: the failing case becomes a regression test

Weak answer: “I’d add more logging.” (Too vague, misses the trace-level insight required.)

Question 2: “How do you prevent an agent from running forever in production?”

Strong answer covers three layers:

Iteration limits: hard cap with explicit error handling, not silent termination
Loop detection: signature-based detection for repeated tool calls
Cost/time circuit breakers: budget-based limits as a backstop

Mention that you also design against loops architecturally: clear tool descriptions that signal completion, fallback tools that prevent dead ends, and explicit done/not-done signals from tools.

Question 3: “A tool your agent depends on starts returning 503 errors. Walk me through your incident response.”

Strong answer:

Immediate: retry logic should absorb transient errors; if retries are failing, the tool’s circuit breaker trips and returns a structured error to the LLM
Short-term: monitor the tool’s error rate dashboard; if it exceeds threshold, route to fallback tool
User-facing: the agent degrades gracefully — partial answer with a note that one data source is unavailable
Recovery: when the tool recovers, flush any cached degraded responses

Question 4: “How do you test agent error handling before it goes to production?”

Strong answer describes a layered testing strategy:

Unit tests for each tool’s error handling (timeout, bad parameters, partial results)
Integration tests that inject synthetic errors using mocks
Evaluation datasets built from real production failures (captured via tracing)
Chaos testing in staging: randomly inject tool errors at low rates to verify recovery paths

For deeper context on agent architecture, review the AI Agents guide, agentic patterns, and AI guardrails.

10. Summary and Key Takeaways

Reliable agent systems rest on four practices: trace everything by default, classify errors before handling them, enforce hard iteration limits, and build your evaluation dataset from real production failures.

The Observability Stack in One View

Layer	Tools	What It Catches
Infrastructure	Datadog, CloudWatch	API outages, rate limits, cost spikes
Execution traces	LangSmith, Langfuse	Behavioral failures, reasoning errors
Structured logs	Your app + log aggregator	Tool error patterns, business events
Quality evaluation	LLM-as-judge, human review	Wrong answers, hallucinations

Engineering Principles for Agent Reliability

Trace everything by default — the marginal cost of tracing is trivial; the cost of not being able to debug a production failure is not
Classify errors before handling them — transient failures need retry, permanent failures need fallback, state errors need validation
Always set iteration limits — never deploy an agent without a hard cap on loop count
Design tool errors to be informative — an error message the LLM can reason about is worth more than a retry
Use checkpointing for tasks over 5 steps — resumption is cheaper than restart
Build your evaluation dataset from production failures — every incident is a future regression test
Monitor agent-specific metrics — iteration count, loop detection rate, and task completion rate tell you things latency and error rate cannot

AI Agents Guide — ReAct loop, tool use, memory architecture, and multi-agent patterns
Agentic Design Patterns — Sequential, parallel, and supervisor workflow patterns
LangSmith vs Langfuse — Detailed comparison of the two leading agent observability tools
LLMOps Guide — Evaluation pipelines, deployment patterns, and production operations
AI Guardrails — Input/output validation, safety checks, and content filtering for agent systems

Last updated: March 2026. LangGraph, LangSmith, and Langfuse APIs evolve frequently — verify specific method signatures against current documentation before implementing.

Agent Debugging & Error Handling — Tracing, Logging & Recovery (2026)

1. Why Agent Debugging Matters

Why Standard Debugging Does Not Work for Agents

What You Will Learn

2. Why Agent Debugging Is Hard

Non-Determinism at Every Layer

The Reproduction Problem

State Corruption Is Silent

3. Execution Tracing

What Execution Tracing Captures

LangSmith for LangGraph Agents

Langfuse for Framework-Agnostic Tracing

Visual Explanation

📊 Visual Explanation

4. Observability Stack

The Four Layers of Agent Observability

📊 Visual Explanation

What to Log at Each Tool Boundary

5. Error Classification

Three Categories of Agent Errors

6. Recovery Patterns

Pattern 1: Retry with Exponential Backoff

Pattern 2: Fallback Strategies

Pattern 3: Checkpoint-Based Recovery

Visual Explanation

📊 Visual Explanation

7. Preventing Infinite Loops

Why Agents Loop

Iteration Limits

Loop Detection

Cost and Time Circuit Breakers

8. Production Monitoring

Metrics That Matter for Agent Systems

Alerting Strategy

Connecting Traces to Support Tickets

9. Agent Debugging Interview Questions

What Interviewers Test in Agent Reliability Questions

10. Summary and Key Takeaways

The Observability Stack in One View

Engineering Principles for Agent Reliability

Related Pages

Frequently Asked Questions