Agent Debugging & Error Handling — Tracing, Logging & Recovery (2026)
Production agents fail in ways that RAG pipelines never do. A retrieval system either returns documents or it doesn’t — the failure surface is narrow and well-understood. An agent can get stuck in a reasoning loop, misinterpret a tool error as valid data, exhaust its context window mid-task, or silently produce wrong results that look correct. Debugging these failures requires a completely different approach from standard application debugging.
This guide covers the full observability and error handling stack for production agents: execution tracing, structured logging, error classification, recovery patterns, and the techniques that prevent infinite loops before they reach production.
1. Why Agent Debugging Matters
Section titled “1. Why Agent Debugging Matters”Agent failures are non-deterministic and path-dependent — debugging them requires trace-level observability, not log-level debugging.
Why Standard Debugging Does Not Work for Agents
Section titled “Why Standard Debugging Does Not Work for Agents”When a conventional web service returns a 500 error, you look at the stack trace, find the failing line, and fix it. The failure is deterministic — the same inputs produce the same error every time. You can reproduce it locally and write a unit test that pins the fix.
Agent failures work differently. An agent that misclassifies a tool error, calls the wrong tool because of an ambiguous description, or gets stuck in a reasoning loop may succeed nine times and fail on the tenth — with no code change between the two runs. The failure depends on the LLM’s non-deterministic output, the specific wording of the user’s request, the sequence of tool results, and the accumulated context at each step.
This creates a debugging gap. Most engineers building their first agents try to debug them the same way they debug APIs: add a few print statements, check the final output, look for obvious errors. This works for simple cases. It fails completely for any agent running more than three or four steps, using more than two tools, or operating across multiple sessions.
The correct mental model is this: agent debugging is trace-level observability, not log-level debugging. You need to capture and inspect the full sequence of thoughts, actions, and observations — not just inputs and outputs.
What You Will Learn
Section titled “What You Will Learn”This guide covers:
- Why agent debugging requires different tooling than standard application monitoring
- How to implement execution tracing with LangSmith and Langfuse
- The observability stack every production agent system needs
- How to classify and handle the three categories of agent errors
- Recovery patterns: retry, fallback, and checkpoint-based resumption
- Techniques for detecting and preventing infinite loops
- Production monitoring and alerting for agent systems
- What interviewers ask about agent reliability and how to answer
2. Why Agent Debugging Is Hard
Section titled “2. Why Agent Debugging Is Hard”Non-determinism at the LLM, tool result, and context accumulation layers compounds across steps, making agent failures nearly impossible to reproduce without a full execution trace.
Non-Determinism at Every Layer
Section titled “Non-Determinism at Every Layer”Agents introduce non-determinism at three distinct layers, and each layer compounds the previous one.
Layer 1: LLM non-determinism. With temperature > 0, the same prompt generates different outputs on different runs. Even with temperature = 0, model updates, tokenization changes, or context window differences can change behavior. An agent that worked perfectly in development may behave differently in production after a model version update.
Layer 2: Tool result variability. Tools call external APIs, databases, and services. Those services have their own state, rate limits, and failure modes. A web search tool returns different results depending on when you call it. A database query returns different results as data changes. The agent’s reasoning path depends on these results, so variability in tools produces variability in agent behavior even if the LLM is deterministic.
Layer 3: Context accumulation. Each iteration adds content to the context window. The LLM’s reasoning at step 7 depends on everything that happened in steps 1 through 6. A small deviation in step 3 — a slightly different tool result, a subtly different phrasings of a thought — can cascade into a completely different decision at step 7. This path dependence makes failures extremely hard to reproduce.
The Reproduction Problem
Section titled “The Reproduction Problem”In standard software, reproducing a bug means running the same code with the same inputs. In agent systems, reproducing a failure means replaying the exact sequence of LLM outputs and tool results, in the exact order, in the exact context state they occurred. Without a trace recording system, this is impossible.
This is why execution tracing is not optional for production agents — it is the prerequisite for any debugging at all.
State Corruption Is Silent
Section titled “State Corruption Is Silent”Unlike a stack overflow or a null pointer exception, agent state corruption produces no error. The agent continues executing. It may even produce a response that looks plausible. But the internal state — the accumulated context, the task progress, the memory — is wrong, and every subsequent step builds on that corrupted foundation.
Examples of silent state corruption:
- A tool returns malformed JSON; the agent parses it as empty rather than an error, silently losing data
- A summarization step loses a critical fact; subsequent reasoning proceeds without it
- A state update writes to the wrong key; the agent references the old value in later steps
Detecting these requires explicit validation at each state transition — not just checking for exceptions, but asserting that the state is what you expect it to be.
3. Execution Tracing
Section titled “3. Execution Tracing”A complete execution trace records every LLM call, tool invocation, state snapshot, and timing for each iteration — without it, reproducing any agent failure is nearly impossible.
What Execution Tracing Captures
Section titled “What Execution Tracing Captures”A complete execution trace for one agent invocation includes:
- Run metadata: unique run ID, session ID, user ID, start/end timestamp, total tokens, total cost, model version
- Each LLM call: full prompt (system + messages), raw output, token counts, latency, model parameters
- Each tool call: tool name, input parameters, raw output, execution time, success/failure status
- Reasoning steps: the agent’s explicit thoughts at each iteration
- State snapshots: the agent’s internal state at each checkpoint (LangGraph nodes, memory updates)
- Final output: the response delivered to the user, with evaluation metadata if available
This is significantly more data than standard application logs. A 10-step agent invocation with 5 tool calls generates perhaps 50,000–200,000 tokens of trace data. At scale — thousands of invocations per day — this requires a purpose-built trace storage and query system.
LangSmith for LangGraph Agents
Section titled “LangSmith for LangGraph Agents”LangSmith is the native tracing solution for LangGraph. Setup requires minimal code change: set two environment variables, and all LangGraph agent invocations are traced automatically.
import osos.environ["LANGCHAIN_TRACING_V2"] = "true"os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"os.environ["LANGCHAIN_PROJECT"] = "production-agent"
# LangGraph agent code — no other changes neededfrom langgraph.graph import StateGraph# ... your agent definitionLangSmith captures the complete LangGraph execution tree, with each node as a separate span. You can inspect the full state at every node transition, replay any trace, and compare traces side-by-side. For multi-agent systems with supervisor patterns, LangSmith visualizes the entire invocation tree including all sub-agent calls.
Key LangSmith features for debugging:
- Trace timeline: visual view of execution sequence with timing per step
- State diff: shows exactly what changed in agent state at each node
- Token cost breakdown: per-step token usage and estimated cost
- Dataset creation: mark any trace as a test case for evaluation
- Human annotation: add correctness labels to traces for building evaluation datasets
For a detailed comparison of LangSmith vs Langfuse, see LangSmith vs Langfuse.
Langfuse for Framework-Agnostic Tracing
Section titled “Langfuse for Framework-Agnostic Tracing”Langfuse provides tracing for any LLM framework via manual instrumentation or auto-instrumentation decorators. It is well-suited for custom agents or systems that combine multiple frameworks.
from langfuse.decorators import observe, langfuse_context
@observe()def run_agent(user_message: str, session_id: str): langfuse_context.update_current_trace( session_id=session_id, user_id="user-123", tags=["production", "customer-support"] ) # Agent execution here return result
@observe(name="tool:search_knowledge_base")def search_knowledge_base(query: str) -> str: # Tool implementation return resultsThe @observe decorator automatically captures function name, inputs, outputs, latency, and any exceptions. Langfuse stores traces in an open-source backend you can self-host, which is important for compliance use cases where traces must not leave your infrastructure.
Visual Explanation
Section titled “Visual Explanation”📊 Visual Explanation
Section titled “📊 Visual Explanation”Agent Execution Trace — What Gets Captured at Each Step
A complete trace records every decision point, tool call, and state transition. Without this, reproducing and diagnosing agent failures is nearly impossible.
4. Observability Stack
Section titled “4. Observability Stack”Production agent observability requires four layers — infrastructure metrics, execution traces, structured event logs, and semantic quality evaluation — each catching distinct failure modes.
The Four Layers of Agent Observability
Section titled “The Four Layers of Agent Observability”Standard application monitoring tools — dashboards showing request rates, latency histograms, and error rates — are necessary but not sufficient for production agents. You need four additional layers of observability that standard tooling does not provide.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Agent Observability Stack
Four layers of visibility, from infrastructure metrics up to semantic quality evaluation. Each layer catches different failure modes.
Layer 1 — Infrastructure & Cost Monitoring catches operational failures: the model API is down, you are hitting rate limits, latency is spiking, costs are running over budget. Standard APM tools handle this layer well.
Layer 2 — Execution Tracing catches behavioral failures: the agent took the wrong path, a tool returned unexpected data, reasoning diverged from expected patterns. LangSmith and Langfuse handle this layer.
Layer 3 — Structured Event Logging bridges infrastructure and behavior. Log discrete, parseable events for every significant agent decision: {event: "tool_error", tool: "search_api", error_type: "timeout", retry_count: 2, session_id: "..."}. These logs power alerting dashboards and let you answer operational questions like “how often does the search tool time out on Tuesdays?”
Layer 4 — Semantic Quality Evaluation catches output quality issues: the agent completed successfully by all technical measures but produced a wrong answer. This layer requires LLM-as-judge evaluation or human review pipelines. See LLMOps guide for detailed evaluation pipeline architecture.
What to Log at Each Tool Boundary
Section titled “What to Log at Each Tool Boundary”Every tool call should emit a structured log event with at minimum:
import json, time, loggingfrom datetime import datetime, timezone
logger = logging.getLogger("agent.tools")
def instrumented_tool_call(tool_name: str, params: dict, session_id: str): start = time.monotonic() event = { "event": "tool_call_start", "tool": tool_name, "params": params, "session_id": session_id, "timestamp": datetime.now(timezone.utc).isoformat(), } logger.info(json.dumps(event))
try: result = execute_tool(tool_name, params) duration_ms = int((time.monotonic() - start) * 1000) logger.info(json.dumps({ "event": "tool_call_success", "tool": tool_name, "duration_ms": duration_ms, "result_length": len(str(result)), "session_id": session_id, })) return result except Exception as e: duration_ms = int((time.monotonic() - start) * 1000) logger.error(json.dumps({ "event": "tool_call_error", "tool": tool_name, "error_type": type(e).__name__, "error_message": str(e), "duration_ms": duration_ms, "session_id": session_id, })) raiseStructured JSON logs are machine-parseable. They feed directly into dashboards, alerting rules, and anomaly detection pipelines. Unstructured text logs require regex parsing and break whenever the message format changes.
5. Error Classification
Section titled “5. Error Classification”Agent errors fall into three categories — tool failures, LLM reasoning errors, and silent state corruption — each requiring a fundamentally different response strategy.
Three Categories of Agent Errors
Section titled “Three Categories of Agent Errors”Agent errors fall into three fundamentally different categories, each requiring a different response strategy.
Category 1: Tool Failures
The agent calls a tool, and the tool returns an error or unexpected output. This is the most common error category in production and the most tractable — tools are regular code, so most failures have known causes and predictable mitigations.
Subcategories:
- Transient failures: network timeouts, rate limit errors, temporary service unavailability — resolvable with retry
- Input validation errors: the LLM passed malformed parameters — resolvable with better tool descriptions and parameter validation
- Semantic errors: the tool succeeded technically but returned empty or irrelevant results — requires fallback strategies
- Partial failures: the tool returned some data but not all — requires explicit handling of incomplete results
Category 2: LLM Reasoning Errors
The LLM produces output that is syntactically valid but semantically wrong. These are harder to detect because they do not throw exceptions.
Subcategories:
- Hallucinated tool calls: the LLM invents a tool name that does not exist — caught by tool dispatch validation
- Parameter hallucination: valid tool name but nonsense parameters — caught by JSON Schema validation
- Reasoning drift: the LLM loses track of the original goal after several tool calls and pursues a different objective
- Premature termination: the LLM decides the task is complete when it is not
- Over-iteration: the LLM keeps calling tools when it already has enough information to answer
Category 3: State Corruption
The agent’s internal state becomes inconsistent — a state update fails, a context summarization loses critical information, or a multi-agent message passes corrupted data.
State corruption is the most dangerous category because it is silent: the agent continues running on a corrupted foundation. Detection requires explicit state validation at each checkpoint.
6. Recovery Patterns
Section titled “6. Recovery Patterns”Three layered recovery patterns handle the full error spectrum: exponential backoff for transient failures, fallback strategies for persistent failures, and checkpoint-based resumption for long-running tasks.
Pattern 1: Retry with Exponential Backoff
Section titled “Pattern 1: Retry with Exponential Backoff”The correct default for transient tool failures. Do not simply retry immediately — this hammers a service that is already under pressure. Use exponential backoff with jitter.
import asyncio, random
async def retry_with_backoff( tool_func, params: dict, max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 30.0, retryable_errors: tuple = (TimeoutError, ConnectionError),): for attempt in range(max_retries + 1): try: return await tool_func(**params) except retryable_errors as e: if attempt == max_retries: raise delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay) await asyncio.sleep(delay) except Exception: raise # Non-retryable errors fail immediatelyAlways distinguish retryable from non-retryable errors. A 429 rate limit error is retryable. A 400 bad request error (malformed parameters) is not — retrying will produce the same failure.
Pattern 2: Fallback Strategies
Section titled “Pattern 2: Fallback Strategies”When a tool fails after retries, or when it returns empty/irrelevant results, provide the LLM with an alternative path rather than terminating with an error.
Effective fallback designs:
- Alternative tools: “If
search_webfails, fall back tosearch_knowledge_base” - Degraded responses: return a cached result with a staleness warning rather than nothing
- Explicit error messages: return a structured error the LLM can reason about —
{"error": "search_unavailable", "suggestion": "Use the knowledge_base_search tool instead"} - Human escalation: for high-stakes failures, surface a human-in-the-loop interrupt rather than guessing
The key design principle: never let tool errors silently return empty strings or None. When the LLM receives an empty response, it does not know whether the tool found nothing or failed. This ambiguity is a primary driver of infinite loops — the agent keeps retrying a tool that failed, mistaking silence for “no results found.”
Pattern 3: Checkpoint-Based Recovery
Section titled “Pattern 3: Checkpoint-Based Recovery”For long-running agents, implement checkpointing so the agent can resume from a known-good state after a failure rather than restarting from scratch.
LangGraph provides built-in checkpointing via persistence backends:
from langgraph.graph import StateGraphfrom langgraph.checkpoint.postgres import PostgresSaver
# Configure checkpointercheckpointer = PostgresSaver.from_conn_string( "postgresql://user:password@localhost/agent_checkpoints")
# Attach to graphgraph = workflow.compile(checkpointer=checkpointer)
# Resume interrupted runconfig = {"configurable": {"thread_id": "session-abc-123"}}result = await graph.ainvoke( {"messages": [HumanMessage(content="Resume task")]}, config=config)With checkpointing, if an agent fails at step 8 of a 15-step task, it resumes from step 7’s checkpoint rather than starting over. For tasks that involve expensive tool calls (web scraping, database queries, LLM calls), this is not just convenient — it can save significant time and cost.
LangGraph’s persistence layer supports SQLite (development), PostgreSQL (production), and Redis (high-throughput). Store checkpoint data with a TTL appropriate for your use case — session-scoped checkpoints can expire after 24 hours; task-scoped checkpoints may need longer retention.
Visual Explanation
Section titled “Visual Explanation”📊 Visual Explanation
Section titled “📊 Visual Explanation”Error Recovery Decision Tree
Three-layer recovery strategy: first retry transient errors, then fall back to alternatives, then checkpoint-resume or human escalation.
7. Preventing Infinite Loops
Section titled “7. Preventing Infinite Loops”Infinite loops occur when an agent cannot make forward progress but does not recognize it is stuck — prevent them with hard iteration limits, signature-based loop detection, and cost circuit breakers.
Why Agents Loop
Section titled “Why Agents Loop”An infinite loop occurs when the agent cannot make forward progress but does not recognize that it is stuck. Common triggers:
- A tool returns an ambiguous response (empty, partial, or unintelligible) and the LLM interprets it as “try again with different parameters”
- The LLM’s reasoning oscillates between two approaches without committing to either
- A tool succeeds, but the result does not contain what the LLM expected, so it retries with the same parameters
- A multi-agent handoff fails silently, and the supervisor keeps re-delegating the same task
Iteration Limits
Section titled “Iteration Limits”The most important safeguard. Every agent must have a hard iteration limit that terminates execution and raises an explicit error, not a silent empty response.
# LangGraph iteration limitconfig = { "recursion_limit": 15, # Maximum graph iterations "configurable": {"thread_id": session_id}}
try: result = await graph.ainvoke(input_state, config=config)except GraphRecursionError as e: logger.error(json.dumps({ "event": "iteration_limit_exceeded", "session_id": session_id, "iteration_count": 15, })) return {"error": "Task complexity exceeded processing limit. Please simplify your request."}For most production use cases, 10–20 iterations is sufficient. Tasks requiring more iterations are usually misdesigned — they should be broken into smaller subtasks. If you consistently see agents hitting iteration limits, that is a signal to redesign the task decomposition, not to raise the limit.
Loop Detection
Section titled “Loop Detection”Beyond iteration limits, implement active loop detection. If the agent calls the same tool with the same parameters twice in a row, it is stuck:
from collections import defaultdictfrom hashlib import md5import json
class LoopDetector: def __init__(self, threshold: int = 2): self.threshold = threshold self.call_history: list[str] = []
def check(self, tool_name: str, params: dict) -> bool: """Returns True if a loop is detected.""" call_signature = md5( json.dumps({"tool": tool_name, "params": params}, sort_keys=True).encode() ).hexdigest() self.call_history.append(call_signature)
# Check last N calls for repeated signature recent = self.call_history[-self.threshold:] return len(set(recent)) == 1 and len(recent) == self.threshold
def reset(self): self.call_history.clear()When a loop is detected, do not simply terminate — provide the LLM with explicit guidance: “You have called search_web with the same query twice. The tool appears to have limited results for this query. Consider using a different query, a different tool, or answering based on what you already know.”
Cost and Time Circuit Breakers
Section titled “Cost and Time Circuit Breakers”For production systems with hard cost or latency budgets:
from dataclasses import dataclass, fieldimport time
@dataclassclass AgentBudget: max_tokens: int = 50_000 max_cost_usd: float = 0.50 max_duration_seconds: float = 120.0 tokens_used: int = 0 cost_usd: float = 0.0 start_time: float = field(default_factory=time.monotonic)
def check(self) -> tuple[bool, str]: if self.tokens_used >= self.max_tokens: return False, f"Token budget exceeded ({self.tokens_used} tokens)" if self.cost_usd >= self.max_cost_usd: return False, f"Cost budget exceeded (${self.cost_usd:.4f})" elapsed = time.monotonic() - self.start_time if elapsed >= self.max_duration_seconds: return False, f"Time budget exceeded ({elapsed:.1f}s)" return True, "ok"Circuit breakers should fail gracefully — return a partial result with a clear explanation, not a silent empty response.
8. Production Monitoring
Section titled “8. Production Monitoring”Agent-specific metrics — iteration count per invocation, loop detection trigger rate, and task completion rate — reveal failure modes that standard latency and error rate dashboards miss entirely.
Metrics That Matter for Agent Systems
Section titled “Metrics That Matter for Agent Systems”Standard web service metrics (request rate, latency, error rate) apply to agents but miss the most important signals. Add these agent-specific metrics to your monitoring dashboard:
| Metric | What It Detects | Alert Threshold |
|---|---|---|
| Average iterations per invocation | Increasing = agent design degradation | > 2× baseline |
| P95 token cost per invocation | Cost spikes | > 3× median |
| Tool error rate by tool name | Specific tool degradation | > 5% for any tool |
| Iteration limit hit rate | Agents getting stuck | > 1% of invocations |
| Loop detection trigger rate | Reasoning oscillation | > 0.5% of invocations |
| Task completion rate | Overall agent reliability | < 90% is a P1 incident |
| Context window utilization | Approaching limits | > 80% warrants review |
Alerting Strategy
Section titled “Alerting Strategy”Agent alerting requires different thresholds and routing than standard services. Recommended structure:
P1 alerts (page on-call immediately):
- Task completion rate drops below 85%
- Cost per invocation spikes >5× in a 15-minute window
- Model API returns 5xx errors
P2 alerts (ticket + next-business-day investigation):
- Any tool’s error rate exceeds 10% in a rolling 1-hour window
- Iteration limit hit rate exceeds 2%
- Average iteration count increases >50% week-over-week
Weekly review (no immediate action):
- Token cost trends
- Tool usage distribution changes
- Loop detection trigger frequency
Connecting Traces to Support Tickets
Section titled “Connecting Traces to Support Tickets”When a user reports an agent failure, you need to retrieve the exact execution trace for their session. Implement a session ID that flows from the user-facing request through every layer:
import uuidfrom contextvars import ContextVar
current_session_id: ContextVar[str] = ContextVar("session_id")
def start_agent_session(user_id: str) -> str: session_id = f"{user_id}-{uuid.uuid4().hex[:8]}" current_session_id.set(session_id)
# Pass to tracing backend langfuse_context.update_current_trace( session_id=session_id, user_id=user_id, ) return session_idShow users their session ID in error messages: “If you need support, reference session ID: abc-12345.” This makes it possible to retrieve the exact trace — with full thought/action/observation sequence — for any user-reported failure.
9. Agent Debugging Interview Questions
Section titled “9. Agent Debugging Interview Questions”Agent reliability is a standard senior GenAI interview topic — interviewers test whether you have debugged non-deterministic systems at scale, not whether you know tool names.
What Interviewers Test in Agent Reliability Questions
Section titled “What Interviewers Test in Agent Reliability Questions”Agent debugging and error handling is a standard senior GenAI engineering interview topic as of 2025–2026. Interviewers are not testing whether you know that LangSmith exists — they are testing whether you have thought through the operational reality of running agents at scale.
Question 1: “How would you debug a production agent that intermittently returns wrong answers?”
Strong answer structure:
- Start with observability: you need the full execution trace for failing invocations, not just final outputs
- Compare failing vs passing traces: look for the divergence point — where did the execution path differ?
- Classify the error: is this a tool failure, an LLM reasoning error, or state corruption?
- Reproduce in staging: use the trace to replay the exact input sequence
- Fix and add to evaluation dataset: the failing case becomes a regression test
Weak answer: “I’d add more logging.” (Too vague, misses the trace-level insight required.)
Question 2: “How do you prevent an agent from running forever in production?”
Strong answer covers three layers:
- Iteration limits: hard cap with explicit error handling, not silent termination
- Loop detection: signature-based detection for repeated tool calls
- Cost/time circuit breakers: budget-based limits as a backstop
Mention that you also design against loops architecturally: clear tool descriptions that signal completion, fallback tools that prevent dead ends, and explicit done/not-done signals from tools.
Question 3: “A tool your agent depends on starts returning 503 errors. Walk me through your incident response.”
Strong answer:
- Immediate: retry logic should absorb transient errors; if retries are failing, the tool’s circuit breaker trips and returns a structured error to the LLM
- Short-term: monitor the tool’s error rate dashboard; if it exceeds threshold, route to fallback tool
- User-facing: the agent degrades gracefully — partial answer with a note that one data source is unavailable
- Recovery: when the tool recovers, flush any cached degraded responses
Question 4: “How do you test agent error handling before it goes to production?”
Strong answer describes a layered testing strategy:
- Unit tests for each tool’s error handling (timeout, bad parameters, partial results)
- Integration tests that inject synthetic errors using mocks
- Evaluation datasets built from real production failures (captured via tracing)
- Chaos testing in staging: randomly inject tool errors at low rates to verify recovery paths
For deeper context on agent architecture, review the AI Agents guide, agentic patterns, and AI guardrails.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”Reliable agent systems rest on four practices: trace everything by default, classify errors before handling them, enforce hard iteration limits, and build your evaluation dataset from real production failures.
The Observability Stack in One View
Section titled “The Observability Stack in One View”| Layer | Tools | What It Catches |
|---|---|---|
| Infrastructure | Datadog, CloudWatch | API outages, rate limits, cost spikes |
| Execution traces | LangSmith, Langfuse | Behavioral failures, reasoning errors |
| Structured logs | Your app + log aggregator | Tool error patterns, business events |
| Quality evaluation | LLM-as-judge, human review | Wrong answers, hallucinations |
Engineering Principles for Agent Reliability
Section titled “Engineering Principles for Agent Reliability”- Trace everything by default — the marginal cost of tracing is trivial; the cost of not being able to debug a production failure is not
- Classify errors before handling them — transient failures need retry, permanent failures need fallback, state errors need validation
- Always set iteration limits — never deploy an agent without a hard cap on loop count
- Design tool errors to be informative — an error message the LLM can reason about is worth more than a retry
- Use checkpointing for tasks over 5 steps — resumption is cheaper than restart
- Build your evaluation dataset from production failures — every incident is a future regression test
- Monitor agent-specific metrics — iteration count, loop detection rate, and task completion rate tell you things latency and error rate cannot
Related Pages
Section titled “Related Pages”- AI Agents Guide — ReAct loop, tool use, memory architecture, and multi-agent patterns
- Agentic Design Patterns — Sequential, parallel, and supervisor workflow patterns
- LangSmith vs Langfuse — Detailed comparison of the two leading agent observability tools
- LLMOps Guide — Evaluation pipelines, deployment patterns, and production operations
- AI Guardrails — Input/output validation, safety checks, and content filtering for agent systems
Last updated: March 2026. LangGraph, LangSmith, and Langfuse APIs evolve frequently — verify specific method signatures against current documentation before implementing.