Skip to content

Deep Research AI Agent — Build Your Own (2026)

A deep research AI agent breaks complex questions into sub-queries, searches multiple sources in parallel, extracts and ranks findings, and synthesizes everything into a structured report with citations. This guide covers how OpenAI Deep Research and Gemini Deep Research work, and walks through building your own deep research agent with LangGraph.

Deep research agents exist because standard LLM calls produce single-pass answers from training data — they cannot plan queries, cross-reference sources, or iterate when initial results are insufficient.

Standard LLM interactions hit a wall with complex research questions. Ask a chatbot “What are the trade-offs between different vector database indexing strategies for billion-scale datasets?” and you get a single-pass answer based on training data — no source verification, no cross-referencing, no depth.

Deep research agents solve this by mimicking how a skilled human researcher works:

  1. Break the question down into specific, answerable sub-questions
  2. Search multiple sources with varied queries to maximize coverage
  3. Cross-reference findings to identify consensus, contradictions, and gaps
  4. Synthesize a report with citations, confidence levels, and actionable conclusions

The key architectural insight: deep research is not a single LLM call. It is an agentic pattern — a multi-step workflow where the agent plans, executes, evaluates, and iterates.

  • Engineers evaluating technical decisions across dozens of sources
  • Product teams conducting competitive analysis with verifiable data
  • Analysts synthesizing market research from scattered reports
  • Anyone who currently spends 2-4 hours manually searching, reading, and summarizing

Traditional RAG retrieves documents matching a query and generates a response. Deep research goes further — it generates multiple queries, searches diverse sources, and iterates when initial results are insufficient. Where RAG answers questions from a known corpus, deep research agents explore unknown territory.


DevelopmentImpact
OpenAI Deep Researcho3-powered agent that browses the web for up to 30 minutes, producing cited research reports. Available in ChatGPT Pro and Plus tiers
Gemini Deep ResearchGoogle’s implementation using Gemini 2.0 Flash with multi-step planning and Google Search integration. Free tier included
Perplexity Pro SearchIterative search with follow-up queries and source citations. Strongest at real-time web research
Open-source agentsLangGraph, CrewAI, and AutoGen frameworks make custom deep research agents accessible to any engineering team
MCP integrationModel Context Protocol enables agents to connect to arbitrary data sources — internal wikis, databases, APIs — not just web search
Structured outputJSON mode and tool-use APIs make it reliable to extract structured data from search results without brittle parsing

A deep research agent follows a four-stage pipeline — query planning, multi-source search, extract and rank, structured report — and loops back through planning whenever gap detection finds unanswered sub-questions.

Deep Research Agent Pipeline

Each stage feeds into the next. The agent can loop back from Synthesis to Query Planning if gaps are detected.

1. Query PlanningDecompose question into search strategy
Parse research question
Identify sub-questions
Generate diverse queries
Prioritize by importance
2. Multi-Source SearchExecute queries across data sources
Web search APIs
Academic databases
Internal knowledge bases
Structured data endpoints
3. Extract & RankParse results and assess quality
Extract key passages
Score by relevance
Deduplicate findings
Flag contradictions
4. Structured ReportSynthesize findings with citations
Organize by theme
Add source citations
Assign confidence levels
Identify remaining gaps
Idle

The Core Loop: Plan → Search → Extract → Synthesize → (Repeat)

Section titled “The Core Loop: Plan → Search → Extract → Synthesize → (Repeat)”

What separates a deep research agent from a simple search-and-summarize chain is the iterative refinement loop. After the first synthesis pass, the agent evaluates its own output:

  • Are there unanswered sub-questions? Generate new queries and search again.
  • Do sources contradict each other? Search for authoritative tiebreakers.
  • Is confidence too low on a critical finding? Seek additional corroboration.

This self-evaluation is the difference between a research assistant and a search engine wrapper.

ApproachQueriesSourcesIterationsTimeBest For
Single RAG call1Your corpus1<5 secondsKnown-corpus Q&A
Multi-query RAG3-5Your corpus1<15 secondsComplex internal questions
Deep research agent10-50Web + APIs + corpus2-52-30 minutesOpen-ended research

Each stage of the pipeline builds on the previous one: question decomposition determines search quality, query diversification prevents echo chambers, and gap detection drives iterative refinement.

The agent receives a broad research question and decomposes it into specific, searchable sub-questions. This step determines the quality of everything that follows.

Example input: “What are the best practices for deploying LLMs in regulated healthcare environments?”

Decomposed sub-questions:

  1. What regulatory frameworks apply to AI/LLM use in healthcare? (HIPAA, FDA, EU AI Act)
  2. What are the data residency requirements for healthcare LLM deployments?
  3. Which cloud providers offer HIPAA-compliant LLM hosting?
  4. What audit and explainability requirements exist for clinical AI systems?
  5. What are documented failure modes of LLMs in healthcare settings?

Step 2: Query Generation and Diversification

Section titled “Step 2: Query Generation and Diversification”

For each sub-question, the agent generates multiple search queries with different angles:

  • Direct queries: “HIPAA requirements for LLM deployment healthcare”
  • Comparative queries: “AWS vs Azure vs GCP HIPAA compliant LLM hosting”
  • Recency-focused queries: “FDA AI regulation updates 2026”
  • Academic queries: “clinical LLM deployment risk assessment framework”

Query diversification prevents the agent from getting trapped in a single perspective or source cluster.

The agent executes queries across multiple source types simultaneously:

  • Web search (Tavily, Serper, Brave Search API) for current information
  • Academic search (Semantic Scholar, arXiv API) for peer-reviewed research
  • Internal sources (vector databases, wikis, documentation) for proprietary data
  • Structured APIs (government regulation databases, clinical trial registries) for authoritative records

Parallel execution is critical — sequential search across 20-50 queries would take minutes per iteration.

Raw search results contain noise. The extraction step:

  1. Parses each result to extract the relevant passage (not just the snippet)
  2. Scores relevance against the original sub-question (0-1 scale)
  3. Assesses source authority — peer-reviewed paper vs blog post vs forum comment
  4. Deduplicates findings that appear in multiple sources
  5. Flags contradictions where two authoritative sources disagree

The synthesis stage combines ranked findings into a coherent report. The critical addition is gap detection — the agent identifies sub-questions that remain unanswered or have low-confidence answers, triggering another iteration of the plan-search-extract loop.

A well-built agent caps iterations (typically 2-5) to prevent infinite loops and includes a “remaining gaps” section in the final report.


5. Building a Deep Research Agent with LangGraph

Section titled “5. Building a Deep Research Agent with LangGraph”

The following implementation uses LangGraph to build a deep research agent with state management, parallel search, and iterative refinement.

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
# ── State definition ──────────────────────────────
class ResearchState(TypedDict):
question: str
sub_questions: list[str]
search_queries: list[dict] # {sub_question, query, source}
raw_results: list[dict] # {query, results[], source}
ranked_findings: list[dict] # {finding, relevance, source_url, authority}
report: str
gaps: list[str]
iteration: int
max_iterations: int
# ── LLM setup ─────────────────────────────────────
llm = ChatOpenAI(model="gpt-4o", temperature=0)
search_tool = TavilySearchResults(max_results=5)
def decompose_question(state: ResearchState) -> dict:
"""Break the research question into specific sub-questions."""
prompt = f"""You are a research planning agent. Decompose this research
question into 3-7 specific, searchable sub-questions.
Research question: {state['question']}
Return a JSON array of sub-questions. Each should be:
- Specific enough to search for directly
- Independent (answerable without the others)
- Covering a distinct facet of the main question"""
response = llm.invoke(prompt)
sub_questions = parse_json_array(response.content)
return {"sub_questions": sub_questions, "iteration": state.get("iteration", 0) + 1}
def generate_queries(state: ResearchState) -> dict:
"""Generate diverse search queries for each sub-question."""
all_queries = []
for sq in state["sub_questions"]:
prompt = f"""Generate 3 diverse search queries for this research
sub-question. Vary the angle: one direct, one comparative, one recent.
Sub-question: {sq}
Return a JSON array of objects with 'query' and 'source' fields.
Source should be 'web', 'academic', or 'news'."""
response = llm.invoke(prompt)
queries = parse_json_array(response.content)
for q in queries:
q["sub_question"] = sq
all_queries.extend(queries)
return {"search_queries": all_queries}
import asyncio
async def execute_searches(state: ResearchState) -> dict:
"""Run all search queries in parallel across sources."""
async def search_one(query_obj: dict) -> dict:
results = await search_tool.ainvoke(query_obj["query"])
return {
"query": query_obj["query"],
"sub_question": query_obj["sub_question"],
"results": results,
"source": query_obj.get("source", "web"),
}
tasks = [search_one(q) for q in state["search_queries"]]
raw_results = await asyncio.gather(*tasks)
return {"raw_results": list(raw_results)}
def extract_and_rank(state: ResearchState) -> dict:
"""Extract relevant findings and rank by relevance and authority."""
all_findings = []
for result_set in state["raw_results"]:
for result in result_set.get("results", []):
prompt = f"""Score this search result for relevance to the
sub-question. Return JSON with 'finding' (key excerpt),
'relevance' (0-1), and 'authority' (low/medium/high).
Sub-question: {result_set['sub_question']}
Result title: {result.get('title', '')}
Result content: {result.get('content', '')[:1500]}"""
response = llm.invoke(prompt)
finding = parse_json_object(response.content)
finding["source_url"] = result.get("url", "")
finding["sub_question"] = result_set["sub_question"]
all_findings.append(finding)
# Sort by relevance, deduplicate by content similarity
ranked = sorted(all_findings, key=lambda f: f.get("relevance", 0), reverse=True)
deduplicated = deduplicate_findings(ranked, similarity_threshold=0.85)
return {"ranked_findings": deduplicated}
def synthesize_report(state: ResearchState) -> dict:
"""Combine findings into a structured report with gap detection."""
findings_text = format_findings_for_prompt(state["ranked_findings"])
prompt = f"""You are a research synthesis agent. Create a structured
research report from these findings.
Original question: {state['question']}
Sub-questions: {state['sub_questions']}
Findings (ranked by relevance):
{findings_text}
Your report must include:
1. Executive summary (3-5 sentences)
2. Findings organized by sub-question with inline citations [source_url]
3. Contradictions between sources (if any)
4. Confidence level for each section (high/medium/low)
5. Remaining gaps — sub-questions with insufficient evidence
Also return a JSON array of 'gaps' — sub-questions that need more research."""
response = llm.invoke(prompt)
report, gaps = parse_report_and_gaps(response.content)
return {"report": report, "gaps": gaps}
def should_continue(state: ResearchState) -> str:
"""Decide whether to iterate or finish."""
if not state.get("gaps"):
return "finish"
if state.get("iteration", 0) >= state.get("max_iterations", 3):
return "finish"
return "iterate"
# ── Build the graph ───────────────────────────────
graph = StateGraph(ResearchState)
graph.add_node("decompose", decompose_question)
graph.add_node("generate_queries", generate_queries)
graph.add_node("search", execute_searches)
graph.add_node("extract_rank", extract_and_rank)
graph.add_node("synthesize", synthesize_report)
graph.set_entry_point("decompose")
graph.add_edge("decompose", "generate_queries")
graph.add_edge("generate_queries", "search")
graph.add_edge("search", "extract_rank")
graph.add_edge("extract_rank", "synthesize")
graph.add_conditional_edges("synthesize", should_continue, {
"iterate": "generate_queries", # Loop back with gaps as new sub-questions
"finish": END,
})
research_agent = graph.compile()
# ── Run ───────────────────────────────────────────
result = research_agent.invoke({
"question": "What are the best practices for deploying LLMs in healthcare?",
"max_iterations": 3,
"sub_questions": [],
"search_queries": [],
"raw_results": [],
"ranked_findings": [],
"report": "",
"gaps": [],
"iteration": 0,
})
print(result["report"])

For a deeper dive into building multi-node agent graphs, see the LangGraph tutorial and LangChain vs LangGraph comparison.


OpenAI Deep Research, Gemini Deep Research, and Perplexity Pro each implement the same core pipeline with different model tiers, research depth, and pricing — the comparison below covers when to use each versus building custom.

FeatureOpenAI Deep ResearchGemini Deep ResearchPerplexity Pro
Underlying modelo3 (reasoning)Gemini 2.0 FlashCustom multi-model
Search sourcesWeb browsing (full page read)Google Search + ScholarWeb + academic + news
Research time5-30 minutes1-5 minutes30 seconds - 2 minutes
Max iterationsMultiple (adaptive)Up to 53-5 follow-up queries
Output formatLong-form report with citationsStructured document with linksCited paragraphs
Source citationsInline with URLsInline with Google linksNumbered footnotes
File analysisPDF, spreadsheet, code uploadGoogle Drive integrationLimited
API accessResponses API (2026)Gemini APIPerplexity API
PricingChatGPT Pro ($200/mo) or Plus (limited)Free tier + Gemini AdvancedPerplexity Pro ($20/mo)
Best forDeep technical analysisBroad topic surveysQuick factual research

Use commercial tools when:

  • Research targets are publicly available web content
  • You need results within minutes, not days of development
  • Your team does not have ML engineering capacity
  • The research question is a one-off, not a recurring pipeline

Build custom when:

  • You need to search internal data sources (wikis, databases, Slack, Confluence)
  • Regulatory requirements prohibit sending data to third-party APIs
  • You need deterministic, auditable research pipelines
  • Research is a recurring workflow that justifies the engineering investment
  • You need to integrate with your existing tool ecosystem

Deep research agents fail in five predictable ways — query echo chambers, source authority confusion, infinite refinement loops, hallucinated synthesis, and token cost explosion — each with a concrete mitigation.

Query echo chamber: The agent generates similar queries that all return the same top results. Mitigation: force query diversification with explicit angle requirements (comparative, historical, contrarian).

Source authority confusion: The agent treats a blog post with the same weight as a peer-reviewed paper. Mitigation: implement source-type classification and weight academic/official sources higher in ranking.

Infinite refinement loops: Gap detection always finds something missing, causing the agent to iterate until hitting the cap. Mitigation: set a minimum confidence threshold — if 80%+ of sub-questions are answered at medium-high confidence, stop iterating.

Hallucinated synthesis: The LLM invents connections between findings that do not exist in the source material. Mitigation: require every claim in the synthesis to reference a specific finding ID from the extraction step.

Token cost explosion: 50 queries x 5 results x extraction LLM calls = 250+ LLM invocations per research session. Mitigation: use a cheap model (GPT-4o-mini, Claude Haiku) for extraction/ranking and reserve the expensive model for synthesis only.

ConfigurationQueriesLLM CallsEst. CostLatency
Minimal (1 iteration, 3 sub-q, 2 queries each)6~35$0.05-0.1530-60 seconds
Standard (2 iterations, 5 sub-q, 3 queries each)30~170$0.30-0.802-5 minutes
Thorough (3 iterations, 7 sub-q, 3 queries each)63+~350+$0.80-2.005-15 minutes

Costs assume GPT-4o-mini for extraction and GPT-4o for planning/synthesis (March 2026 pricing).


Interview questions on deep research agents test agentic architecture knowledge and system design judgment — the strong answer patterns below show exactly what distinguishes candidates who have built these systems.

Deep research agent questions test your understanding of agentic architectures, multi-step reasoning, and system design trade-offs. Interviewers want to see that you can design a system that handles real-world complexity — not just chain LLM calls together.

Q: “Design a system that can research any topic and produce a cited report.”

Weak: “I would use RAG to search a database and then have the LLM write a report based on the results.”

Strong: “I would build a multi-stage agent with four phases. First, a planning phase decomposes the question into sub-questions and generates diverse search queries — varying angles prevent echo chamber results. Second, a parallel search phase executes queries against web APIs, academic databases, and internal sources. Third, an extraction phase scores each result for relevance and source authority, deduplicates, and flags contradictions. Fourth, a synthesis phase produces a structured report with inline citations and confidence levels. The agent loops back to planning if gap detection identifies unanswered sub-questions, capped at 3 iterations. I would use a cheap model for extraction and an expensive model for synthesis to control costs.”

Q: “How would you evaluate whether a deep research agent produces accurate reports?”

Weak: “I would read the reports and check if they look correct.”

Strong: “Three evaluation dimensions. Factual accuracy: sample 20% of cited claims and verify them against the source URL — are the claims actually supported by the cited source? Coverage: compare the sub-questions identified by the agent against a human-generated question decomposition — did the agent miss important facets? Contradiction handling: inject known contradictory sources and verify the report flags them instead of silently picking one. I would track these metrics over a benchmark set of 50 research questions using the evaluation frameworks we discussed.”

  • Compare deep research agents vs simple RAG — when do you use each?
  • How would you prevent a research agent from hallucinating citations?
  • Design the state management for a multi-iteration research agent
  • What is the cost/quality trade-off of using smaller models for extraction?
  • How would you add human-in-the-loop review to a research pipeline?
  • Explain how query diversification prevents echo chamber results

Three architecture patterns — async job with webhook, streaming progress updates, and scheduled pipeline — address the different latency and interactivity requirements of production deep research deployments.

Pattern 1: Async Job with Webhook Callback

User submits question → Job queue (Redis/SQS) → Research agent (worker) → Store report → Webhook notification

Best for: research that takes 2-15 minutes. The user submits a question and gets notified when the report is ready. Avoids HTTP timeout issues.

Pattern 2: Streaming Progress Updates

User submits question → WebSocket connection → Agent streams: sub-questions → search progress → findings → report sections

Best for: interactive applications where users want to see the agent working. OpenAI Deep Research uses this pattern — you watch it browse in real time.

Pattern 3: Scheduled Research Pipeline

Cron trigger → Research agent → Diff against previous report → Alert on changes → Store versioned reports

Best for: recurring competitive analysis, regulatory monitoring, or market tracking. The agent runs daily/weekly and flags what changed.

Rate limiting: Search APIs have rate limits (Tavily: 1000/month free, Serper: 2500/month free). Implement request budgets per research session.

Caching: Cache search results for 24 hours. If the same sub-question appears across multiple research sessions, reuse cached results instead of burning API calls.

Observability: Log every LLM call with input/output, every search query with results count, and every ranking decision. Use LangSmith or Langfuse for trace visualization.

Cost guardrails: Set per-session cost caps. A runaway research agent with 5 iterations and 50 queries can easily spend $5-10 per session. Monitor and alert on cost anomalies.

Source freshness: Tag every finding with its publication date. When synthesizing, weight recent sources higher for fast-moving topics (AI, regulations) but not for stable topics (mathematics, physics).


The table below answers the six most common deep research questions in 30 seconds, followed by official documentation links and related guides.

QuestionAnswer
What is a deep research agent?An AI system that plans queries, searches multiple sources, and iteratively synthesizes cited reports
When should I use one?When you need comprehensive, multi-source research that goes beyond single-query RAG
Build or buy?Use commercial tools (OpenAI, Gemini) for web research. Build custom for internal data or regulatory requirements
What framework?LangGraph for stateful multi-step agents. See the LangGraph tutorial
Key failure mode?Query echo chambers and hallucinated synthesis. Mitigate with query diversification and citation verification
Cost range?$0.05-2.00 per research session depending on depth and model tier

Last updated: March 2026. Deep research agent capabilities are evolving rapidly; verify current API access and pricing against official documentation.

Frequently Asked Questions

What is a deep research AI agent?

A deep research AI agent breaks complex questions into sub-queries, searches multiple sources in parallel, extracts and ranks findings, and synthesizes everything into a structured report with citations. Unlike a standard LLM call that gives a single-pass answer from training data, deep research agents mimic skilled human researchers by planning queries, cross-referencing findings, and identifying consensus and contradictions.

How do OpenAI and Gemini deep research features work?

Both OpenAI Deep Research and Gemini Deep Research use agentic patterns internally — they decompose questions into sub-queries, search the web with varied queries, cross-reference findings across sources, and synthesize reports. The key architectural insight is that deep research is not a single LLM call but a multi-step agentic workflow where the agent plans, executes, evaluates, and iterates.

How do you build your own deep research agent?

Build a deep research agent using LangGraph with a state graph that includes nodes for query planning (decompose the question into sub-queries), parallel search execution (search multiple sources simultaneously), finding extraction and ranking, and report synthesis with citations. The agent loops through search and evaluation until coverage is sufficient, then synthesizes the final structured report.

When should I use a deep research agent instead of standard RAG?

Use deep research agents when the question requires searching multiple external sources, cross-referencing findings for contradictions, synthesizing information from scattered reports, or producing structured reports with citations and confidence levels. Standard RAG works for querying a fixed document corpus. Deep research is for open-ended questions that require the kind of multi-source investigation a human researcher would perform.

What tools and APIs support deep research agents?

Deep research agents use web search APIs (Tavily, Serper, Brave Search), academic search APIs (Semantic Scholar, arXiv), and internal data sources via MCP. Frameworks like LangGraph, CrewAI, and AutoGen provide the agent orchestration layer. Commercial options include OpenAI Deep Research, Gemini Deep Research, and Perplexity Pro Search.

How does deep research differ from standard RAG?

Standard RAG retrieves documents from a fixed corpus using a single query and generates a response. Deep research generates multiple queries across diverse sources, iterates when initial results are insufficient, cross-references findings for contradictions, and produces structured reports with citations and confidence levels. Where RAG answers from a known corpus, deep research explores unknown territory across the open web and multiple data sources.

What are the key components of a deep research system?

A deep research system has four core components: a query planner that decomposes questions into sub-queries, a multi-source search executor that runs queries in parallel across web, academic, and internal sources, an extraction and ranking module that scores findings by relevance and source authority, and a synthesis engine that produces structured reports with citations and gap detection.

What are common failure modes of deep research agents?

The five main failure modes are query echo chambers (similar queries returning the same results), source authority confusion (treating blog posts equally with peer-reviewed papers), infinite refinement loops (gap detection always finding something missing), hallucinated synthesis (the LLM inventing connections not in source material), and token cost explosion from hundreds of LLM calls per session. Each can be mitigated with query diversification, source classification, confidence thresholds, citation verification, and tiered model usage.

How much does it cost to run a deep research agent?

Costs range from $0.05 to $2.00 per research session depending on configuration. A minimal run with 6 queries and 1 iteration costs $0.05-0.15. A standard run with 30 queries and 2 iterations costs $0.30-0.80. A thorough run with 63+ queries and 3 iterations costs $0.80-2.00. Using cheaper models for extraction and expensive models only for synthesis helps control costs.

Can you use deep research agents in production applications?

Yes, three production architecture patterns are common: async job with webhook callback for research taking 2-15 minutes, streaming progress updates via WebSocket for interactive applications, and scheduled research pipelines for recurring competitive analysis or market monitoring. Production deployments require rate limiting on search APIs, result caching, cost guardrails, and observability for every LLM call and search query.