Deep Research AI Agent — Build Your Own (2026)

A deep research AI agent breaks complex questions into sub-queries, searches multiple sources in parallel, extracts and ranks findings, and synthesizes everything into a structured report with citations. This guide covers how OpenAI Deep Research and Gemini Deep Research work, and walks through building your own deep research agent with LangGraph.

1. Why Deep Research AI Matters

Deep research agents exist because standard LLM calls produce single-pass answers from training data — they cannot plan queries, cross-reference sources, or iterate when initial results are insufficient.

Why Deep Research Agents Exist

Standard LLM interactions hit a wall with complex research questions. Ask a chatbot “What are the trade-offs between different vector database indexing strategies for billion-scale datasets?” and you get a single-pass answer based on training data — no source verification, no cross-referencing, no depth.

Deep research agents solve this by mimicking how a skilled human researcher works:

Break the question down into specific, answerable sub-questions
Search multiple sources with varied queries to maximize coverage
Cross-reference findings to identify consensus, contradictions, and gaps
Synthesize a report with citations, confidence levels, and actionable conclusions

The key architectural insight: deep research is not a single LLM call. It is an agentic pattern — a multi-step workflow where the agent plans, executes, evaluates, and iterates.

Who Should Use Deep Research Agents

Engineers evaluating technical decisions across dozens of sources
Product teams conducting competitive analysis with verifiable data
Analysts synthesizing market research from scattered reports
Anyone who currently spends 2-4 hours manually searching, reading, and summarizing

The Limitation of Single-Pass Search

Traditional RAG retrieves documents matching a query and generates a response. Deep research goes further — it generates multiple queries, searches diverse sources, and iterates when initial results are insufficient. Where RAG answers questions from a known corpus, deep research agents explore unknown territory.

2. What’s New in 2026

Development	Impact
OpenAI Deep Research	o3-powered agent that browses the web for up to 30 minutes, producing cited research reports. Available in ChatGPT Pro and Plus tiers
Gemini Deep Research	Google’s implementation using Gemini 2.0 Flash with multi-step planning and Google Search integration. Free tier included
Perplexity Pro Search	Iterative search with follow-up queries and source citations. Strongest at real-time web research
Open-source agents	LangGraph, CrewAI, and AutoGen frameworks make custom deep research agents accessible to any engineering team
MCP integration	Model Context Protocol enables agents to connect to arbitrary data sources — internal wikis, databases, APIs — not just web search
Structured output	JSON mode and tool-use APIs make it reliable to extract structured data from search results without brittle parsing

3. How Deep Research AI Works

A deep research agent follows a four-stage pipeline — query planning, multi-source search, extract and rank, structured report — and loops back through planning whenever gap detection finds unanswered sub-questions.

The Deep Research Pipeline

📊 Visual Explanation

Deep Research Agent Pipeline

Each stage feeds into the next. The agent can loop back from Synthesis to Query Planning if gaps are detected.

1. Query PlanningDecompose question into search strategy

Parse research question

Identify sub-questions

Generate diverse queries

Prioritize by importance

2. Multi-Source SearchExecute queries across data sources

Web search APIs

Academic databases

Internal knowledge bases

Structured data endpoints

3. Extract & RankParse results and assess quality

Extract key passages

Score by relevance

Deduplicate findings

Flag contradictions

4. Structured ReportSynthesize findings with citations

Organize by theme

Add source citations

Assign confidence levels

Identify remaining gaps

Idle

The Core Loop: Plan → Search → Extract → Synthesize → (Repeat)

What separates a deep research agent from a simple search-and-summarize chain is the iterative refinement loop. After the first synthesis pass, the agent evaluates its own output:

Are there unanswered sub-questions? Generate new queries and search again.
Do sources contradict each other? Search for authoritative tiebreakers.
Is confidence too low on a critical finding? Seek additional corroboration.

This self-evaluation is the difference between a research assistant and a search engine wrapper.

Mental Model: Research Depth vs Breadth

Approach	Queries	Sources	Iterations	Time	Best For
Single RAG call	1	Your corpus	1	<5 seconds	Known-corpus Q&A
Multi-query RAG	3-5	Your corpus	1	<15 seconds	Complex internal questions
Deep research agent	10-50	Web + APIs + corpus	2-5	2-30 minutes	Open-ended research

4. How Deep Research Agents Work

Each stage of the pipeline builds on the previous one: question decomposition determines search quality, query diversification prevents echo chambers, and gap detection drives iterative refinement.

Step 1: Question Decomposition

The agent receives a broad research question and decomposes it into specific, searchable sub-questions. This step determines the quality of everything that follows.

Example input: “What are the best practices for deploying LLMs in regulated healthcare environments?”

Decomposed sub-questions:

What regulatory frameworks apply to AI/LLM use in healthcare? (HIPAA, FDA, EU AI Act)
What are the data residency requirements for healthcare LLM deployments?
Which cloud providers offer HIPAA-compliant LLM hosting?
What audit and explainability requirements exist for clinical AI systems?
What are documented failure modes of LLMs in healthcare settings?

Step 2: Query Generation and Diversification

For each sub-question, the agent generates multiple search queries with different angles:

Direct queries: “HIPAA requirements for LLM deployment healthcare”
Comparative queries: “AWS vs Azure vs GCP HIPAA compliant LLM hosting”
Recency-focused queries: “FDA AI regulation updates 2026”
Academic queries: “clinical LLM deployment risk assessment framework”

Query diversification prevents the agent from getting trapped in a single perspective or source cluster.

Step 3: Parallel Multi-Source Search

The agent executes queries across multiple source types simultaneously:

Web search (Tavily, Serper, Brave Search API) for current information
Academic search (Semantic Scholar, arXiv API) for peer-reviewed research
Internal sources (vector databases, wikis, documentation) for proprietary data
Structured APIs (government regulation databases, clinical trial registries) for authoritative records

Parallel execution is critical — sequential search across 20-50 queries would take minutes per iteration.

Step 4: Extraction and Ranking

Raw search results contain noise. The extraction step:

Parses each result to extract the relevant passage (not just the snippet)
Scores relevance against the original sub-question (0-1 scale)
Assesses source authority — peer-reviewed paper vs blog post vs forum comment
Deduplicates findings that appear in multiple sources
Flags contradictions where two authoritative sources disagree

Step 5: Synthesis with Gap Detection

The synthesis stage combines ranked findings into a coherent report. The critical addition is gap detection — the agent identifies sub-questions that remain unanswered or have low-confidence answers, triggering another iteration of the plan-search-extract loop.

A well-built agent caps iterations (typically 2-5) to prevent infinite loops and includes a “remaining gaps” section in the final report.

5. Building a Deep Research Agent with LangGraph

The following implementation uses LangGraph to build a deep research agent with state management, parallel search, and iterative refinement.

Agent Architecture

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults

# ── State definition ──────────────────────────────
class ResearchState(TypedDict):
    question: str
    sub_questions: list[str]
    search_queries: list[dict]       # {sub_question, query, source}
    raw_results: list[dict]          # {query, results[], source}
    ranked_findings: list[dict]      # {finding, relevance, source_url, authority}
    report: str
    gaps: list[str]
    iteration: int
    max_iterations: int

# ── LLM setup ─────────────────────────────────────
llm = ChatOpenAI(model="gpt-4o", temperature=0)
search_tool = TavilySearchResults(max_results=5)

Node 1: Question Decomposition

def decompose_question(state: ResearchState) -> dict:
    """Break the research question into specific sub-questions."""
    prompt = f"""You are a research planning agent. Decompose this research
question into 3-7 specific, searchable sub-questions.

Research question: {state['question']}

Return a JSON array of sub-questions. Each should be:
- Specific enough to search for directly
- Independent (answerable without the others)
- Covering a distinct facet of the main question"""

    response = llm.invoke(prompt)
    sub_questions = parse_json_array(response.content)

    return {"sub_questions": sub_questions, "iteration": state.get("iteration", 0) + 1}

Node 2: Query Generation

def generate_queries(state: ResearchState) -> dict:
    """Generate diverse search queries for each sub-question."""
    all_queries = []
    for sq in state["sub_questions"]:
        prompt = f"""Generate 3 diverse search queries for this research
sub-question. Vary the angle: one direct, one comparative, one recent.

Sub-question: {sq}

Return a JSON array of objects with 'query' and 'source' fields.
Source should be 'web', 'academic', or 'news'."""

        response = llm.invoke(prompt)
        queries = parse_json_array(response.content)
        for q in queries:
            q["sub_question"] = sq
        all_queries.extend(queries)

    return {"search_queries": all_queries}

Node 3: Parallel Search Execution

import asyncio

async def execute_searches(state: ResearchState) -> dict:
    """Run all search queries in parallel across sources."""
    async def search_one(query_obj: dict) -> dict:
        results = await search_tool.ainvoke(query_obj["query"])
        return {
            "query": query_obj["query"],
            "sub_question": query_obj["sub_question"],
            "results": results,
            "source": query_obj.get("source", "web"),
        }

    tasks = [search_one(q) for q in state["search_queries"]]
    raw_results = await asyncio.gather(*tasks)

    return {"raw_results": list(raw_results)}

Node 4: Extract and Rank

def extract_and_rank(state: ResearchState) -> dict:
    """Extract relevant findings and rank by relevance and authority."""
    all_findings = []

    for result_set in state["raw_results"]:
        for result in result_set.get("results", []):
            prompt = f"""Score this search result for relevance to the
sub-question. Return JSON with 'finding' (key excerpt),
'relevance' (0-1), and 'authority' (low/medium/high).

Sub-question: {result_set['sub_question']}
Result title: {result.get('title', '')}
Result content: {result.get('content', '')[:1500]}"""

            response = llm.invoke(prompt)
            finding = parse_json_object(response.content)
            finding["source_url"] = result.get("url", "")
            finding["sub_question"] = result_set["sub_question"]
            all_findings.append(finding)

    # Sort by relevance, deduplicate by content similarity
    ranked = sorted(all_findings, key=lambda f: f.get("relevance", 0), reverse=True)
    deduplicated = deduplicate_findings(ranked, similarity_threshold=0.85)

    return {"ranked_findings": deduplicated}

Node 5: Synthesize Report

def synthesize_report(state: ResearchState) -> dict:
    """Combine findings into a structured report with gap detection."""
    findings_text = format_findings_for_prompt(state["ranked_findings"])

    prompt = f"""You are a research synthesis agent. Create a structured
research report from these findings.

Original question: {state['question']}
Sub-questions: {state['sub_questions']}

Findings (ranked by relevance):
{findings_text}

Your report must include:
1. Executive summary (3-5 sentences)
2. Findings organized by sub-question with inline citations [source_url]
3. Contradictions between sources (if any)
4. Confidence level for each section (high/medium/low)
5. Remaining gaps — sub-questions with insufficient evidence

Also return a JSON array of 'gaps' — sub-questions that need more research."""

    response = llm.invoke(prompt)
    report, gaps = parse_report_and_gaps(response.content)

    return {"report": report, "gaps": gaps}

Assembling the Graph

def should_continue(state: ResearchState) -> str:
    """Decide whether to iterate or finish."""
    if not state.get("gaps"):
        return "finish"
    if state.get("iteration", 0) >= state.get("max_iterations", 3):
        return "finish"
    return "iterate"

# ── Build the graph ───────────────────────────────
graph = StateGraph(ResearchState)

graph.add_node("decompose", decompose_question)
graph.add_node("generate_queries", generate_queries)
graph.add_node("search", execute_searches)
graph.add_node("extract_rank", extract_and_rank)
graph.add_node("synthesize", synthesize_report)

graph.set_entry_point("decompose")
graph.add_edge("decompose", "generate_queries")
graph.add_edge("generate_queries", "search")
graph.add_edge("search", "extract_rank")
graph.add_edge("extract_rank", "synthesize")

graph.add_conditional_edges("synthesize", should_continue, {
    "iterate": "generate_queries",   # Loop back with gaps as new sub-questions
    "finish": END,
})

research_agent = graph.compile()

# ── Run ───────────────────────────────────────────
result = research_agent.invoke({
    "question": "What are the best practices for deploying LLMs in healthcare?",
    "max_iterations": 3,
    "sub_questions": [],
    "search_queries": [],
    "raw_results": [],
    "ranked_findings": [],
    "report": "",
    "gaps": [],
    "iteration": 0,
})

print(result["report"])

For a deeper dive into building multi-node agent graphs, see the LangGraph tutorial and LangChain vs LangGraph comparison.

6. Commercial Deep Research Tools

OpenAI Deep Research, Gemini Deep Research, and Perplexity Pro each implement the same core pipeline with different model tiers, research depth, and pricing — the comparison below covers when to use each versus building custom.

Feature Comparison

Feature	OpenAI Deep Research	Gemini Deep Research	Perplexity Pro
Underlying model	o3 (reasoning)	Gemini 2.0 Flash	Custom multi-model
Search sources	Web browsing (full page read)	Google Search + Scholar	Web + academic + news
Research time	5-30 minutes	1-5 minutes	30 seconds - 2 minutes
Max iterations	Multiple (adaptive)	Up to 5	3-5 follow-up queries
Output format	Long-form report with citations	Structured document with links	Cited paragraphs
Source citations	Inline with URLs	Inline with Google links	Numbered footnotes
File analysis	PDF, spreadsheet, code upload	Google Drive integration	Limited
API access	Responses API (2026)	Gemini API	Perplexity API
Pricing	ChatGPT Pro ($200/mo) or Plus (limited)	Free tier + Gemini Advanced	Perplexity Pro ($20/mo)
Best for	Deep technical analysis	Broad topic surveys	Quick factual research

When to Build Custom vs Use Commercial

Use commercial tools when:

Research targets are publicly available web content
You need results within minutes, not days of development
Your team does not have ML engineering capacity
The research question is a one-off, not a recurring pipeline

Build custom when:

You need to search internal data sources (wikis, databases, Slack, Confluence)
Regulatory requirements prohibit sending data to third-party APIs
You need deterministic, auditable research pipelines
Research is a recurring workflow that justifies the engineering investment
You need to integrate with your existing tool ecosystem

7. Deep Research Trade-offs and Pitfalls

Deep research agents fail in five predictable ways — query echo chambers, source authority confusion, infinite refinement loops, hallucinated synthesis, and token cost explosion — each with a concrete mitigation.

Common Failure Modes

Query echo chamber: The agent generates similar queries that all return the same top results. Mitigation: force query diversification with explicit angle requirements (comparative, historical, contrarian).

Source authority confusion: The agent treats a blog post with the same weight as a peer-reviewed paper. Mitigation: implement source-type classification and weight academic/official sources higher in ranking.

Infinite refinement loops: Gap detection always finds something missing, causing the agent to iterate until hitting the cap. Mitigation: set a minimum confidence threshold — if 80%+ of sub-questions are answered at medium-high confidence, stop iterating.

Hallucinated synthesis: The LLM invents connections between findings that do not exist in the source material. Mitigation: require every claim in the synthesis to reference a specific finding ID from the extraction step.

Token cost explosion: 50 queries x 5 results x extraction LLM calls = 250+ LLM invocations per research session. Mitigation: use a cheap model (GPT-4o-mini, Claude Haiku) for extraction/ranking and reserve the expensive model for synthesis only.

Cost and Latency Trade-offs

Configuration	Queries	LLM Calls	Est. Cost	Latency
Minimal (1 iteration, 3 sub-q, 2 queries each)	6	~35	$0.05-0.15	30-60 seconds
Standard (2 iterations, 5 sub-q, 3 queries each)	30	~170	$0.30-0.80	2-5 minutes
Thorough (3 iterations, 7 sub-q, 3 queries each)	63+	~350+	$0.80-2.00	5-15 minutes

Costs assume GPT-4o-mini for extraction and GPT-4o for planning/synthesis (March 2026 pricing).

8. Deep Research AI Interview Questions

Interview questions on deep research agents test agentic architecture knowledge and system design judgment — the strong answer patterns below show exactly what distinguishes candidates who have built these systems.

What Interviewers Expect

Deep research agent questions test your understanding of agentic architectures, multi-step reasoning, and system design trade-offs. Interviewers want to see that you can design a system that handles real-world complexity — not just chain LLM calls together.

Strong vs Weak Answer Patterns

Q: “Design a system that can research any topic and produce a cited report.”

❌ Weak: “I would use RAG to search a database and then have the LLM write a report based on the results.”

✅ Strong: “I would build a multi-stage agent with four phases. First, a planning phase decomposes the question into sub-questions and generates diverse search queries — varying angles prevent echo chamber results. Second, a parallel search phase executes queries against web APIs, academic databases, and internal sources. Third, an extraction phase scores each result for relevance and source authority, deduplicates, and flags contradictions. Fourth, a synthesis phase produces a structured report with inline citations and confidence levels. The agent loops back to planning if gap detection identifies unanswered sub-questions, capped at 3 iterations. I would use a cheap model for extraction and an expensive model for synthesis to control costs.”

Q: “How would you evaluate whether a deep research agent produces accurate reports?”

❌ Weak: “I would read the reports and check if they look correct.”

✅ Strong: “Three evaluation dimensions. Factual accuracy: sample 20% of cited claims and verify them against the source URL — are the claims actually supported by the cited source? Coverage: compare the sub-questions identified by the agent against a human-generated question decomposition — did the agent miss important facets? Contradiction handling: inject known contradictory sources and verify the report flags them instead of silently picking one. I would track these metrics over a benchmark set of 50 research questions using the evaluation frameworks we discussed.”

Common Interview Questions

Compare deep research agents vs simple RAG — when do you use each?
How would you prevent a research agent from hallucinating citations?
Design the state management for a multi-iteration research agent
What is the cost/quality trade-off of using smaller models for extraction?
How would you add human-in-the-loop review to a research pipeline?
Explain how query diversification prevents echo chamber results

9. Deep Research in Production

Three architecture patterns — async job with webhook, streaming progress updates, and scheduled pipeline — address the different latency and interactivity requirements of production deep research deployments.

Production Architecture Patterns

Pattern 1: Async Job with Webhook Callback

User submits question → Job queue (Redis/SQS) → Research agent (worker) → Store report → Webhook notification

Best for: research that takes 2-15 minutes. The user submits a question and gets notified when the report is ready. Avoids HTTP timeout issues.

Pattern 2: Streaming Progress Updates

User submits question → WebSocket connection → Agent streams: sub-questions → search progress → findings → report sections

Best for: interactive applications where users want to see the agent working. OpenAI Deep Research uses this pattern — you watch it browse in real time.

Pattern 3: Scheduled Research Pipeline

Cron trigger → Research agent → Diff against previous report → Alert on changes → Store versioned reports

Best for: recurring competitive analysis, regulatory monitoring, or market tracking. The agent runs daily/weekly and flags what changed.

Operational Considerations

Rate limiting: Search APIs have rate limits (Tavily: 1000/month free, Serper: 2500/month free). Implement request budgets per research session.

Caching: Cache search results for 24 hours. If the same sub-question appears across multiple research sessions, reuse cached results instead of burning API calls.

Observability: Log every LLM call with input/output, every search query with results count, and every ranking decision. Use LangSmith or Langfuse for trace visualization.

Cost guardrails: Set per-session cost caps. A runaway research agent with 5 iterations and 50 queries can easily spend $5-10 per session. Monitor and alert on cost anomalies.

Source freshness: Tag every finding with its publication date. When synthesizing, weight recent sources higher for fast-moving topics (AI, regulations) but not for stable topics (mathematics, physics).

10. Summary and Key Takeaways

The table below answers the six most common deep research questions in 30 seconds, followed by official documentation links and related guides.

The Decision in 30 Seconds

Question	Answer
What is a deep research agent?	An AI system that plans queries, searches multiple sources, and iteratively synthesizes cited reports
When should I use one?	When you need comprehensive, multi-source research that goes beyond single-query RAG
Build or buy?	Use commercial tools (OpenAI, Gemini) for web research. Build custom for internal data or regulatory requirements
What framework?	LangGraph for stateful multi-step agents. See the LangGraph tutorial
Key failure mode?	Query echo chambers and hallucinated synthesis. Mitigate with query diversification and citation verification
Cost range?	$0.05-2.00 per research session depending on depth and model tier

Official Documentation

OpenAI Deep Research — Product announcement and capabilities
Gemini Deep Research — Google’s implementation details
LangGraph Documentation — Framework for building stateful agents
Tavily Search API — Search API designed for AI agents
Semantic Scholar API — Academic paper search

AI Agents — Foundations of autonomous AI agent architectures
Agentic Patterns — Design patterns for multi-step agent workflows
LangGraph Tutorial — Hands-on guide to building LangGraph agents
RAG Architecture — Retrieval-augmented generation for knowledge-grounded answers
LLM Evaluation — Measuring agent and model quality systematically
Agentic Frameworks — Comparing LangGraph, CrewAI, AutoGen, and more

Last updated: March 2026. Deep research agent capabilities are evolving rapidly; verify current API access and pricing against official documentation.

Frequently Asked Questions

What is a deep research AI agent?

A deep research AI agent breaks complex questions into sub-queries, searches multiple sources in parallel, extracts and ranks findings, and synthesizes everything into a structured report with citations. Unlike a standard LLM call that gives a single-pass answer from training data, deep research agents mimic skilled human researchers by planning queries, cross-referencing findings, and identifying consensus and contradictions.

How do OpenAI and Gemini deep research features work?

Both OpenAI Deep Research and Gemini Deep Research use agentic patterns internally — they decompose questions into sub-queries, search the web with varied queries, cross-reference findings across sources, and synthesize reports. The key architectural insight is that deep research is not a single LLM call but a multi-step agentic workflow where the agent plans, executes, evaluates, and iterates.

How do you build your own deep research agent?

Build a deep research agent using LangGraph with a state graph that includes nodes for query planning (decompose the question into sub-queries), parallel search execution (search multiple sources simultaneously), finding extraction and ranking, and report synthesis with citations. The agent loops through search and evaluation until coverage is sufficient, then synthesizes the final structured report.

When should I use a deep research agent instead of standard RAG?

Use deep research agents when the question requires searching multiple external sources, cross-referencing findings for contradictions, synthesizing information from scattered reports, or producing structured reports with citations and confidence levels. Standard RAG works for querying a fixed document corpus. Deep research is for open-ended questions that require the kind of multi-source investigation a human researcher would perform.

What tools and APIs support deep research agents?

Deep research agents use web search APIs (Tavily, Serper, Brave Search), academic search APIs (Semantic Scholar, arXiv), and internal data sources via MCP. Frameworks like LangGraph, CrewAI, and AutoGen provide the agent orchestration layer. Commercial options include OpenAI Deep Research, Gemini Deep Research, and Perplexity Pro Search.

How does deep research differ from standard RAG?

Standard RAG retrieves documents from a fixed corpus using a single query and generates a response. Deep research generates multiple queries across diverse sources, iterates when initial results are insufficient, cross-references findings for contradictions, and produces structured reports with citations and confidence levels. Where RAG answers from a known corpus, deep research explores unknown territory across the open web and multiple data sources.

What are the key components of a deep research system?

A deep research system has four core components: a query planner that decomposes questions into sub-queries, a multi-source search executor that runs queries in parallel across web, academic, and internal sources, an extraction and ranking module that scores findings by relevance and source authority, and a synthesis engine that produces structured reports with citations and gap detection.

What are common failure modes of deep research agents?

The five main failure modes are query echo chambers (similar queries returning the same results), source authority confusion (treating blog posts equally with peer-reviewed papers), infinite refinement loops (gap detection always finding something missing), hallucinated synthesis (the LLM inventing connections not in source material), and token cost explosion from hundreds of LLM calls per session. Each can be mitigated with query diversification, source classification, confidence thresholds, citation verification, and tiered model usage.

How much does it cost to run a deep research agent?

Costs range from $0.05 to $2.00 per research session depending on configuration. A minimal run with 6 queries and 1 iteration costs $0.05-0.15. A standard run with 30 queries and 2 iterations costs $0.30-0.80. A thorough run with 63+ queries and 3 iterations costs $0.80-2.00. Using cheaper models for extraction and expensive models only for synthesis helps control costs.

Can you use deep research agents in production applications?

Yes, three production architecture patterns are common: async job with webhook callback for research taking 2-15 minutes, streaming progress updates via WebSocket for interactive applications, and scheduled research pipelines for recurring competitive analysis or market monitoring. Production deployments require rate limiting on search APIs, result caching, cost guardrails, and observability for every LLM call and search query.