Skip to content

Agentic AI Design Patterns — LLM Agent Architecture Guide

A single LLM call is stateless. Given a prompt, it produces a completion. That model is powerful, but insufficient for tasks that require multiple steps, external information, error recovery, or coordination across multiple independent workstreams.

The solution is agents: systems that give the LLM access to tools and allow it to reason across multiple steps before producing a final answer. But agents introduce a new class of engineering problems. How do you structure the reasoning loop? How do you handle tool failures? How do you prevent infinite loops? How do you scale from one agent to many?

Design patterns answer these questions. They are not academic abstractions. They are solutions to failure modes that have been encountered, diagnosed, and solved repeatedly across production GenAI systems. A team that knows these patterns builds more reliable agents faster. A team that does not reinvents them, badly, under deadline pressure.

In 2022, academic papers described techniques like ReAct and Chain-of-Thought, but there was no engineering-level pattern library for agent design. Teams built ad hoc reasoning loops, discovered the failure modes independently, and built ad hoc fixes. By 2024, recognizable patterns had emerged across the industry — LangGraph, AutoGen, CrewAI, and Claude Code all implement variants of the same core patterns. Understanding the patterns lets you reason about these systems independently of the framework implementing them.

Patterns apply whenever an agent needs to handle more than a single-turn request:

  • Multi-step tasks with dependencies between steps
  • Tasks requiring external data retrieval, calculation, or code execution
  • Tasks where output quality is uncertain and requires evaluation
  • Tasks that benefit from parallel execution across independent workstreams

For simple single-turn completions, patterns add overhead without value. The pattern selection decision should be driven by the complexity and reliability requirements of the task.


The simplest possible agent implementation: pass the user’s query to the LLM with a list of tools, and ask it to use the tools and answer. This works for demonstrations. In production, it fails in predictable ways.

Unbounded loops. Without a termination condition, an agent reasoning about a complex problem may loop indefinitely — calling tools, observing results, calling more tools, never converging. This burns tokens and produces no answer.

Silent tool failure. A tool returns an error. The agent sees the error message, decides to try a different tool, encounters a second error, and returns a confident-sounding answer based on no valid data. Nothing in the loop explicitly handles the failure state.

Context window exhaustion. A long reasoning chain accumulates observations from multiple tool calls. By step fifteen, the earliest observations have been pushed out of the context window. The agent makes decisions based on incomplete history.

Scope creep. Given a task like “analyze this codebase and suggest improvements,” an agent without explicit scope boundaries will read every file, call every available tool, and never deliver a focused answer. The blast radius of the task is undefined.

No quality feedback. An agent produces an answer. There is no mechanism to evaluate whether the answer is correct before returning it to the user.

Each of these failure modes has a corresponding design pattern that addresses it. Understanding the problem motivates the pattern.

The Cost of Getting This Wrong in Production

Section titled “The Cost of Getting This Wrong in Production”

A customer support agent that loops silently when it cannot find information in the knowledge base will either time out or return a hallucinated answer. A code analysis agent without scope control will read configuration files, lock files, and generated code — diluting the signal with noise. A research agent without a self-evaluation step will return a well-formatted summary of incorrect information.

These problems do not manifest in demos. They manifest when the system encounters the long tail of production inputs.


Seven design patterns cover the majority of production agent architectures. They are ordered by increasing complexity and decreasing frequency of use — the first few apply in almost every agent system, the later ones apply in specific contexts.

The foundation of all agent patterns. The LLM is given a schema of available tools — their names, descriptions, and typed input parameters. When the LLM determines that a tool is needed, it emits a structured tool call rather than free text. The framework executes the tool and returns the result to the LLM as an observation.

What makes this a pattern rather than just a feature is the discipline around tool design: tools should have clear names, unambiguous descriptions, and a single responsibility. A tool called search_documents that also updates a database is not a well-designed tool. The LLM’s ability to select the right tool depends on clear tool descriptions. Poorly described tools produce incorrect tool selection, which produces incorrect results.

Introduced in a 2022 paper, ReAct interleaves reasoning steps with action steps. Rather than asking the LLM to produce a final answer in one shot, the framework prompts it to emit a Thought (explicit reasoning about what to do), an Action (a specific tool call), and then observe the Observation (the tool’s output). This cycle repeats until the LLM emits a Final Answer.

The insight is that explicit reasoning steps improve tool selection and reduce hallucination. The LLM’s Thought step forces it to articulate why it is calling a particular tool before calling it, which catches many tool misuse errors before they happen.

ReAct is the default loop in LangGraph, LangChain’s AgentExecutor, and most agent implementations you will encounter in the wild. Understanding it is foundational.

ReAct is a reactive pattern — the agent decides what to do at each step based on the previous observation. For long-horizon tasks, this myopia is a problem: the agent loses sight of the overall goal when intermediate steps are complex.

Plan-and-Execute separates the agent into two components. The planner receives the task and produces a complete plan: an ordered list of steps, each with a clear description and success criterion. The executor works through the plan step by step, using tools as needed. The planner only runs once (or when replanning is triggered); the executor runs N times.

The advantage: the planner has full task context when creating the plan. The executor has a clear scope for each step. Neither component has to do both jobs simultaneously.

The trade-off: planning errors are expensive. If the planner misunderstands the task, every executor step is working toward the wrong goal. Some implementations add a replanning step — if the executor encounters an unrecoverable failure, it calls the planner again with updated context.

Reflexion adds a self-evaluation loop after the agent produces an output. Rather than returning the output immediately, a separate evaluation step assesses the output against explicit criteria — correctness, completeness, relevance, safety. If the output fails evaluation, the agent revises it and evaluates again, up to a configured maximum number of iterations.

The evaluator can be the same model used for generation (the LLM critiques its own output), a separate smaller model optimized for evaluation, or a deterministic function (a test runner, a syntax checker, a schema validator).

Reflexion is appropriate for tasks with measurable quality criteria. A code generation agent should run the generated code and evaluate whether the tests pass before returning it. A research synthesis agent should evaluate whether the synthesis actually answers the original question. For tasks with no measurable criteria, reflexion adds latency without improving quality.

Single agents have a bounded context window and a single thread of execution. Complex tasks benefit from parallelism and specialization.

Multi-agent delegation uses an orchestrator agent that decomposes a task and delegates sub-tasks to worker agents. Workers can execute in parallel when tasks are independent. Workers can be specialized — a research worker, a code generation worker, a validation worker, each with its own tool set and system prompt.

The orchestrator collects worker outputs and synthesizes a final result. The complexity is in the orchestration logic: how does the orchestrator decide when all workers are done? How does it handle partial failures? How does it pass context between workers who have dependencies?

LangGraph, AutoGen, and CrewAI each provide different models for this — LangGraph as an explicit state machine, AutoGen as conversational turn-taking, CrewAI as role-and-task delegation. The underlying multi-agent pattern is the same.

By default, LLM agents are stateless between sessions. Memory patterns add persistence:

  • Short-term memory (conversation buffer): The running message history of the current session. Limited by context window size. Managed automatically in most frameworks.
  • Sliding window memory: A fixed-length window of the most recent N messages. Prevents context window exhaustion on long conversations.
  • Summary memory: Periodically summarize the conversation so far and compress it into a shorter representation. Retains key facts while reducing token count.
  • Long-term memory (vector store): Embed past interactions, facts, or documents into a vector database. Retrieve relevant memories at query time using semantic similarity. Enables agents that “remember” facts across sessions.
  • Episodic memory: Store the outcomes of past agent tasks. When a similar task is requested in the future, retrieve the previous execution trace as a reference.

Memory pattern selection depends on the task: customer support agents typically use short-term conversation memory with long-term document retrieval; personal assistant agents use summary memory and episodic recall.

Guardrail patterns validate agent inputs and outputs against safety and quality criteria.

Input validation checks the user’s request before processing. Is the request within scope? Does it contain prompt injection attempts? Is it relevant to the agent’s domain?

Output validation checks the agent’s response before delivering it. Does it reference real information or hallucinated facts? Does it comply with content policies? Does it answer the original question?

Self-critique prompts the agent to evaluate its own output against a rubric before returning it. More flexible than deterministic validators but adds latency and token cost.

Production agent systems at scale use all three. Input validation prevents abuse; output validation prevents harm; self-critique catches quality failures that deterministic validators miss.


Building a production agent involves layering these patterns incrementally. The order matters: each layer adds complexity and should only be added when the simpler layer is proving insufficient.

Before building an agent, design your tools. For each external capability the agent needs, define a tool with:

  • A clear, action-oriented name (search_knowledge_base, not tool1)
  • A description written for the LLM, not for a human — explain when to use the tool, not just what it does
  • Typed input parameters with descriptions
  • A clear return format

A common mistake is building the agent first and the tools second. Tools that were designed for the agent are better than tools that were adapted from existing APIs.

The minimal viable agent is a ReAct loop: a system prompt that instructs the LLM to emit Thought/Action/Observation cycles, a loop that executes tool calls and appends observations to the context, and a termination condition that detects a Final Answer.

In LangGraph, this is a graph with one node per step type and a conditional edge that checks whether the LLM has finished. In LangChain, it is an AgentExecutor with a prompt template that encodes the ReAct format.

Before moving to more complex patterns, run this loop on representative inputs. Observe where it fails. Does it select wrong tools? Does it loop? Does it produce incorrect answers? The failure mode determines the next pattern to add.

Step 3: Add Plan-and-Execute for Long-Horizon Tasks

Section titled “Step 3: Add Plan-and-Execute for Long-Horizon Tasks”

If the ReAct loop performs well on simple tasks but struggles with multi-step tasks that span many tool calls, add Plan-and-Execute.

The planner is typically a separate LLM call at session start that receives the full task description and produces a structured plan. The plan is stored in state and passed to the executor at each step. The executor’s system prompt instructs it to complete only the current step in the plan, not the entire task.

This separation of concerns dramatically improves performance on complex tasks. It also makes the agent’s behavior more predictable and auditable — you can log the plan and inspect whether each execution step matches the planner’s intent.

Step 4: Add Reflexion for Quality-Critical Outputs

Section titled “Step 4: Add Reflexion for Quality-Critical Outputs”

If the agent produces outputs with measurable quality criteria — code that must run, answers that can be fact-checked, documents that must meet a format specification — add a reflexion loop after the initial generation.

A minimal reflexion implementation runs the output through a validation function and, if validation fails, appends the validation result as a new observation and asks the agent to revise. Cap the retry loop at three iterations to prevent infinite refinement.

More sophisticated implementations use a separate evaluator model that produces a structured evaluation: what was good, what was wrong, what should be changed. This structured feedback is more useful to the generator than a raw error message.

Step 5: Introduce Multi-Agent for Parallel Workloads

Section titled “Step 5: Introduce Multi-Agent for Parallel Workloads”

If the task naturally decomposes into independent sub-tasks that could run in parallel, and the single-agent loop is a bottleneck, introduce multi-agent delegation.

Start with the simplest possible decomposition: an orchestrator that generates a list of sub-tasks, a pool of identical worker agents that each execute one sub-task, and a synthesis step that combines results. Only introduce specialized workers when the evidence shows that different sub-tasks benefit from different tools or system prompts.

Multi-agent systems are significantly harder to debug than single-agent systems. Instrument every worker invocation with tracing (LangSmith or equivalent) before deploying to production.


The patterns are not mutually exclusive — production systems combine them. A research agent might use Plan-and-Execute as the outer loop, ReAct within each executor step, Reflexion before delivering the synthesis, and a multi-agent architecture for parallel source retrieval.

The diagram below maps the four primary structural patterns by their complexity and the task characteristics that motivate them.

Agentic Pattern Complexity Tiers

Patterns layered by task complexity and reliability requirements

Tool Use
Single-step
User Request
Tool Selection
Tool Execution
Return Result
ReAct Loop
Multi-step
Thought
Action (Tool Call)
Observation
Repeat or Answer
Plan + Execute
Long-horizon
Planner: Full Plan
Executor: Step 1
Executor: Step N
Synthesize Output
Multi-Agent
Parallel / Specialized
Orchestrator
Worker A
Worker B
Aggregate Results
Idle

Each tier builds on the previous: a multi-agent system typically uses ReAct within each worker and may use Plan-and-Execute at the orchestrator level. Reflexion can be applied at any tier as a quality gate before output is returned.

Tool Use alone is sufficient for retrieval-augmented QA, document classification, and single-step API calls. Adding a ReAct loop to a single-tool agent adds overhead without benefit.

ReAct is the right default for tasks that require 2–8 sequential steps with observable intermediate results. It handles the majority of production agent use cases.

Plan-and-Execute is appropriate when tasks have 10+ steps, when the steps have dependencies that are only visible from a full-task view, or when the ReAct loop is failing due to myopic step selection.

Multi-Agent is appropriate when sub-tasks are independent and can run in parallel, when specialized tool sets or system prompts improve quality on different sub-task types, or when a single agent’s context window is insufficient for the full task.


Example: ReAct Agent in Python with LangGraph

Section titled “Example: ReAct Agent in Python with LangGraph”

The following implements a minimal ReAct agent with two tools using LangGraph’s create_react_agent helper.

from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
@tool
def search_knowledge_base(query: str) -> str:
"""Search the internal knowledge base for information relevant to the query.
Use when the user asks about company policies, product details, or FAQs.
"""
# In production: call your vector store retrieval API
return f"Knowledge base result for: {query}"
@tool
def lookup_order_status(order_id: str) -> str:
"""Look up the current status of a customer order by order ID.
Use when the user provides an order number and asks about delivery status.
"""
# In production: call your order management system API
return f"Order {order_id}: Shipped, estimated delivery 2 days"
model = ChatAnthropic(model="claude-sonnet-4-5-20250929")
tools = [search_knowledge_base, lookup_order_status]
# create_react_agent handles the Thought/Action/Observation loop internally
agent = create_react_agent(model, tools)
result = agent.invoke({
"messages": [{"role": "user", "content": "Where is order #12345?"}]
})
print(result["messages"][-1].content)

Key points in this example: tool descriptions are written for the LLM (they explain when to use the tool, not just what it does), the agent loop is fully managed by create_react_agent, and the model receives only the final assembled messages — you do not manage Thought/Action/Observation formatting manually when using LangGraph.

Example: Plan-and-Execute with Explicit Plan State

Section titled “Example: Plan-and-Execute with Explicit Plan State”
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
model = ChatAnthropic(model="claude-sonnet-4-5-20250929")
def generate_plan(task: str, tools: list[str]) -> list[str]:
"""Planner: produce an ordered list of steps to complete the task."""
response = model.invoke([
SystemMessage(content=(
"You are a task planner. Given a task and available tools, "
"produce a numbered list of concrete steps to complete the task. "
f"Available tools: {', '.join(tools)}"
)),
HumanMessage(content=task),
])
# Parse the numbered list from the response
lines = response.content.strip().split('\n')
return [l.strip() for l in lines if l.strip() and l[0].isdigit()]
def execute_step(step: str, context: str, agent) -> str:
"""Executor: complete one step, given accumulated context."""
result = agent.invoke({
"messages": [{
"role": "user",
"content": f"Context so far:\n{context}\n\nComplete this step only: {step}"
}]
})
return result["messages"][-1].content
# Usage
task = "Analyze the quarterly sales data and produce a summary with key trends."
tools_available = ["read_csv_file", "calculate_statistics", "generate_chart"]
plan = generate_plan(task, tools_available)
context = ""
for step in plan:
result = execute_step(step, context, agent)
context += f"\n{step}: {result}"

This separates planning from execution cleanly. The planner has full task context; the executor has one job per invocation.

Example: Reflexion Loop for Code Generation

Section titled “Example: Reflexion Loop for Code Generation”
import subprocess
def generate_and_validate(task: str, max_retries: int = 3) -> str:
"""Generate code and iteratively refine it until tests pass."""
feedback = ""
for attempt in range(max_retries):
prompt = task if not feedback else f"{task}\n\nPrevious attempt failed:\n{feedback}\nFix the issues."
code = model.invoke([HumanMessage(content=prompt)]).content
# Reflexion: run the code and check for errors
result = subprocess.run(
["python", "-c", code],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0:
return code # Passed — return the working code
# Build structured feedback for the next attempt
feedback = f"Error (attempt {attempt + 1}/{max_retries}):\n{result.stderr}"
return code # Return best attempt after max retries

The key design decision: the feedback passed back to the model is the raw error output, not a paraphrase. The model responds better to exact error messages than to summaries.


7. Trade-offs, Limitations & Failure Modes

Section titled “7. Trade-offs, Limitations & Failure Modes”

Each pattern layer adds latency and cost. A ReAct loop that runs 8 iterations before answering costs 8× the tokens of a single completion. Plan-and-Execute adds a separate planning call. Reflexion adds evaluation and revision calls. Multi-agent multiplies costs by the number of workers.

The trade-off is not complexity vs. simplicity. It is cost vs. reliability. Add pattern layers when the simpler approach is demonstrably failing, not preemptively. Over-engineering an agent system creates latency and cost overhead with no reliability gain.

Plans become stale when execution encounters unexpected intermediate states. An agent tasked with “refactor the authentication module” produces a plan assuming the module uses JWT. Midway through execution, it discovers the module uses session cookies. The plan is now incorrect, but the executor has no mechanism to signal this to the planner.

Solutions: trigger replanning when an executor step fails or produces unexpected output; include an explicit “check plan validity” step in the plan; design plans at a higher level of abstraction that accommodates variation in execution.

ReAct: The Loop-Until-Timeout Anti-Pattern

Section titled “ReAct: The Loop-Until-Timeout Anti-Pattern”

Without a maximum step count, a ReAct agent that cannot find the answer to a query will continue calling tools indefinitely. Always configure a maximum iteration count. When the limit is reached, return the best available answer with an explicit signal that the response may be incomplete — do not return nothing.

Multi-Agent: The Coordination Overhead Problem

Section titled “Multi-Agent: The Coordination Overhead Problem”

Multi-agent systems require explicit interfaces between agents. The orchestrator must produce sub-task descriptions that worker agents can interpret without ambiguity. Workers must return outputs in formats the orchestrator can consume. The more specialized the workers, the higher the coordination overhead.

The failure mode: a worker returns a partial result in an unexpected format. The orchestrator misinterprets it. The synthesis step aggregates misinterpreted results. The final output is incorrect, but the error is invisible in the individual worker outputs.

Mitigation: define typed interfaces for all orchestrator/worker communication; require workers to return structured outputs (JSON schemas, Pydantic models); validate worker outputs before aggregation.

Reflexion is only as good as the evaluator. An LLM evaluating its own output will often give itself high marks — the same biases that produced the original output bias the evaluation. A separate evaluator model with a different training objective is significantly more reliable.

For code: use a syntax checker and test runner (deterministic validators) rather than LLM self-evaluation. For factual accuracy: use retrieval to verify claims rather than asking the model if its claims are correct.


Agentic design patterns are asked with increasing frequency at AI-native companies and at any organization building production LLM systems. For more interview preparation by level, see the GenAI interview questions guide.

“What is the ReAct pattern and why does it exist?” The expected answer covers the interleaving of reasoning and action, why it improves tool selection compared to single-shot tool use, and a concrete example of a Thought/Action/Observation cycle. Mentioning the 2022 paper origin signals depth but is not required. What is required: you can explain what the LLM emits at each step and how the framework processes it.

“How would you design an agent for a complex research task?” This question expects a pattern selection decision, not a framework selection decision. Strong answers: “I would start with Plan-and-Execute because research tasks have many interdependent steps and a single ReAct loop would lose track of the overall goal. Within each executor step I’d use a ReAct sub-loop for tool use. Before returning the synthesis I’d add a Reflexion step to verify that the output actually answers the original question.” Weak answers name a framework without explaining the underlying architecture.

“What happens when an agent gets stuck in a loop?” This tests operational knowledge. The answer should cover: maximum iteration limits as a hard stop, distinguishing between loops caused by tool failure (handle the failure explicitly) vs. loops caused by task ambiguity (return partial answer with explanation), and logging/tracing to detect loops in production.

“How do you evaluate whether your agent is working?” Expected: unit tests for individual tools, integration tests with representative inputs and expected outputs, Reflexion loops for quality-critical outputs, LangSmith or equivalent for production traces, and latency/cost monitoring per agent invocation.


The most common production architecture is a single agent that layers multiple patterns: ReAct as the base loop, Plan-and-Execute for complex tasks, Reflexion before delivery for quality-critical outputs. This is simpler to deploy, monitor, and debug than a multi-agent system. Introduce multi-agent only when the single-agent architecture has a documented bottleneck that multi-agent solves.

Every agent invocation in production should be traced: which tools were called, in what order, what they returned, how many ReAct iterations occurred, whether Reflexion triggered revision. LangSmith, Phoenix (Arize), and Weights & Biases support LLM tracing. Without tracing, debugging production failures is blind.

The specific metrics to track per agent invocation: total latency, tool call count, Reflexion iteration count, final answer token count, and a quality score if you have an evaluation function. Aggregate these metrics over time to detect degradation.

The system prompt and tool descriptions are the primary control surfaces for agent behavior. They are as important as application code and should be treated as such: version-controlled, reviewed in pull requests, and updated based on production failure analysis. A silent change to a tool description can break tool selection across every downstream invocation.

Agents fail. Tools time out, LLM APIs return rate limit errors, retrieved documents are stale. Design each pattern layer with an explicit fallback:

  • ReAct: if the maximum iteration count is reached, return best-available partial answer with a “may be incomplete” flag
  • Plan-and-Execute: if replanning fails, fall back to ReAct on the original task
  • Tool Use: if the primary tool fails, use a fallback tool if available; if not, continue without that tool and flag the gap in the response
  • Multi-Agent: if a worker fails, omit that sub-task from synthesis and flag the gap

A graceful degradation story is part of the production credibility of any agent system.


Seven design patterns cover the majority of production agent architectures:

Tool Use is the foundation. All other patterns depend on well-designed tools with clear descriptions and typed interfaces.

ReAct is the default loop for most agents. It handles multi-step tasks by interleaving explicit reasoning with action. It is sufficient for the majority of production use cases.

Plan-and-Execute adds robustness for long-horizon tasks by separating planning from execution. Use it when ReAct is failing due to myopic step selection, not preemptively.

Reflexion adds quality assurance by evaluating outputs and triggering revision. Prefer deterministic validators (test runners, schema validators) over LLM self-evaluation where possible.

Multi-Agent enables parallelism and specialization. Adds significant coordination complexity. Introduce only when a single agent has a documented bottleneck.

Memory Patterns add persistence across steps and sessions. Choose the memory type based on what the agent needs to remember: short-term conversation buffer for current session, vector store retrieval for long-term knowledge.

Guardrail Patterns protect against misuse and quality failures. Input validation prevents abuse; output validation prevents harm; self-critique catches quality gaps.

Key takeaways:

  • Pattern selection should be driven by observed failure modes, not architectural preference. Add complexity when the simpler approach is demonstrably insufficient.
  • ReAct is sufficient for most production agent tasks. Most agents do not need Plan-and-Execute or multi-agent coordination.
  • Always configure maximum iteration limits. Unbounded reasoning loops are the most common production failure mode in agent systems.
  • Instrument every agent invocation. Without traces, production debugging is impossible.
  • Version your prompts and tool descriptions with the same discipline as application code.
  • Graceful degradation is a requirement, not a nice-to-have.