Skip to content

RAG Architecture Guide — Retrieval-Augmented Generation for Production

Large language models are trained on a static snapshot of the internet with a hard knowledge cutoff. Once training ends, they cannot update what they know. They have no access to your private documents, internal wikis, customer data, real-time information, or anything that did not exist in their training corpus.

This creates a fundamental problem for production applications. A customer service bot needs to know your current product catalog. A legal tool needs to reference the specific contract at hand. An enterprise search system needs to surface internal documentation that was written last week. The LLM’s parametric knowledge — the knowledge baked into its weights — cannot satisfy these requirements.

The obvious solution is to supply the missing information directly. Include the relevant documents in the prompt. Let the LLM reason over that context rather than trying to recall something it was never trained on. This is the core idea behind Retrieval-Augmented Generation: at query time, retrieve the relevant documents from a knowledge base and inject them into the prompt before the LLM responds.

The retrieval step is what makes this practical at scale. You cannot fit an entire enterprise document corpus into a single prompt — context windows are finite and LLM costs scale with token count. Retrieval narrows the corpus to only the documents most relevant to the current query, keeping prompts manageable and costs bounded.

RAG is the dominant architecture for knowledge-intensive LLM applications because it solves the knowledge currency problem without requiring expensive model retraining, and it provides a form of grounding — the LLM’s answer is traceable back to specific source documents.

This guide covers:

  • The two-pipeline architecture of every RAG system: indexing and retrieval
  • Chunking strategies and why they matter more than most engineers expect
  • How embedding models and vector databases enable semantic search
  • Hybrid search, reranking, and the retrieval quality improvement chain
  • Advanced patterns including HyDE, multi-query retrieval, and contextual retrieval
  • How to evaluate a RAG system using RAGAS metrics
  • Where RAG fails in production and how to design against those failures
  • What interviewers expect when discussing RAG architecture

Consider a company with 50,000 internal documents: engineering RFCs, HR policies, product specs, customer contracts, meeting notes, support tickets. A new engineer asks: “What is our company’s policy on rotating database credentials?” Searching manually through a 50,000-document corpus is infeasible. A keyword search returns 300 documents, most irrelevant. A traditional search engine returns results ranked by keyword overlap, not by semantic relevance to the question.

RAG solves this exactly. The offline indexing pipeline processes all 50,000 documents, chunks them, embeds them, and stores them in a vector database. At query time, the engineer’s question is embedded and the vector database returns the top 5 most semantically similar chunks — likely the security policy section that describes credential rotation requirements. Those chunks are injected into the prompt, and the LLM generates a specific, grounded answer.

This is not a contrived example. Enterprise search was one of the first RAG applications deployed at scale, starting in 2023, and remains one of the most common production use cases.

The demo version of RAG is straightforward: load a PDF, chunk it into fixed-size paragraphs, embed them, store in a local vector DB, run a similarity search on the user’s question, stuff the results into a prompt. This works well for demos on clean, single-topic documents.

Production RAG fails in ways the demo never encounters:

  • Heterogeneous documents: A mix of PDFs, HTML pages, Excel files, email threads, database records, and code files — each requiring different preprocessing
  • Document quality variance: Scanned PDFs with poor OCR, documents with heavy table/image content, deeply nested HTML that produces garbage text after extraction
  • Poor retrieval recall: The right document exists but is not retrieved because the query phrasing is different from the document’s language
  • Poor context precision: Retrieval returns 5 chunks, but 4 are tangentially related noise that confuses the LLM and degrades answer quality
  • Multi-hop questions: “What changed between the 2023 and 2024 versions of our security policy?” requires retrieving from two specific documents and comparing them
  • Latency at scale: A vector search that completes in 50ms for 10,000 documents may take 300ms for 10 million

Real RAG engineering is mostly about fixing these production failures systematically.


Every RAG system has two distinct execution paths that must be understood and optimized independently:

The indexing pipeline runs offline, asynchronously, when your knowledge base changes. It transforms raw documents into a form that enables fast, relevant retrieval at query time. It is batch-oriented, latency-tolerant, and can be expensive computationally.

The retrieval pipeline runs online, synchronously, on every user query. It must be fast — typically under 200ms for the retrieval step alone. It is the user-facing critical path.

These two pipelines share a critical interface: the vector database. The indexing pipeline writes vectors into it; the retrieval pipeline reads from it. Every decision about chunking, embedding, and indexing made offline directly determines retrieval quality online.

RAG System Architecture — Two Pipelines

The offline indexing pipeline prepares your knowledge base. The online retrieval pipeline serves every query. Both connect through the vector store.

Offline: IndexRuns async when data changes
Load Documents
Chunk
Embed
Store in Vector DB
Online: RetrieveRuns on every user query
User Query
Embed Query
Vector Search
Return Top-K Chunks
Online: GenerateAugment and respond
Assemble Prompt
Inject Context
LLM Call
Grounded Response
Idle

Chunking is the process of splitting documents into retrievable units. The vector database indexes chunks, not whole documents. Retrieval returns chunks, not whole documents. The granularity of your chunks determines both what can be retrieved and what context the LLM receives.

This decision has more impact on RAG quality than most engineers expect. Common chunking strategies:

Fixed-size chunking: Split every N tokens with an overlap of M tokens. Simple, fast, and predictable. The overlap prevents context loss at chunk boundaries. The main weakness: splits can bisect sentences, paragraphs, or conceptual units arbitrarily, creating chunks that make no sense in isolation.

Semantic chunking: Split on natural document boundaries — paragraphs, sections, sentences — rather than token count. Produces chunks that are semantically coherent. More expensive to compute. The main weakness: chunk sizes vary widely, making it harder to reason about context window usage.

Hierarchical chunking (parent-child): Index small chunks for precise retrieval but return larger parent chunks to the LLM for context. The small chunk gets matched by similarity; the large chunk provides the surrounding context needed for a coherent answer. This is the most effective approach for most production systems.

An embedding model converts a chunk of text into a dense vector — a list of hundreds or thousands of floating-point numbers that encodes the text’s semantic meaning. Chunks with similar meaning have vectors that are close in the high-dimensional space; chunks with different meanings have vectors that are far apart.

The quality of your embedding model is the quality ceiling for your retrieval. A retrieval system that uses a weak embedding model cannot be fixed by improving the vector search algorithm — the signal is lost before search begins.

Common options in 2026:

ModelDimensionsBest For
text-embedding-3-small (OpenAI)1536Cost-efficient general use
text-embedding-3-large (OpenAI)3072Higher quality, higher cost
embed-english-v3.0 (Cohere)1024Strong general quality
nomic-embed-text-v1.5 (open source)768Self-hosted deployments
bge-large-en-v1.5 (BAAI, open source)1024Strong open-source baseline

A vector database stores embeddings alongside their source text and metadata, and supports fast approximate nearest-neighbor (ANN) search. ANN returns vectors close to a query vector without checking every vector in the database — this is what makes vector search fast at scale.

Most vector databases use HNSW (Hierarchical Navigable Small World) indexing, a graph-based ANN algorithm that provides excellent speed-quality tradeoffs.

See Vector Database Comparison for a detailed comparison of Pinecone, Weaviate, Qdrant, and Chroma.

Pure vector search retrieves semantically similar documents. It handles synonyms and paraphrasing well. But it struggles with exact matches — rare proper nouns, product codes, technical identifiers. A query for “JIRA-12345” might retrieve semantically similar tickets rather than that specific ticket.

Hybrid search combines vector search with keyword search (typically BM25, a classical term-frequency algorithm). The two result sets are merged using Reciprocal Rank Fusion (RRF) or a weighted combination. The result is significantly better recall: semantic search handles meaning, keyword search handles precision on exact terms.

Hybrid search is the default choice for production systems with heterogeneous document types.

RAG Retrieval Quality Stack

Each layer improves retrieval quality. Most production systems use hybrid search and reranking on top of basic vector search.

Basic Vector Search
Semantic similarity — fast but misses exact matches and rare terms
Hybrid Search (BM25 + Vector)
Combines semantic and keyword retrieval via Reciprocal Rank Fusion — significantly better recall
Reranking
Cross-encoder model scores each retrieved chunk against the query — improves precision of top-k
Context Compression
Extract only the relevant sentences from each chunk — reduces noise in the LLM's context window
Idle

The initial retrieval step returns the top-k most similar chunks (typically k=10–20) using approximate nearest-neighbor search. This is fast but approximate — the similarity scores are directional, not precise relevance scores.

Reranking applies a more expensive but more accurate model to score each retrieved chunk against the query. The reranker has access to both the query and the candidate chunk simultaneously, allowing it to model their interaction directly. The top 3–5 chunks after reranking are what actually go into the prompt.

Reranking models are typically cross-encoders — models that read both the query and the candidate document as a single input. Popular options include Cohere Rerank, together with open-source cross-encoders from sentence-transformers. The cost is one reranker inference call per retrieved chunk — acceptable overhead for the quality improvement.


Step 1: Document loading

Every document type requires a different loader. PDFs require OCR or text extraction (PyMuPDF or pdfplumber for simple PDFs; Tesseract or Azure Document Intelligence for scanned or complex layouts). Web pages require HTML parsing (BeautifulSoup or Trafilatura). Databases require SQL queries. Each loader must handle encoding issues, corrupt files, and access control.

The output of this step is a list of Document objects, each containing the full text and metadata (source URL, file path, creation date, author, section headings).

Step 2: Chunking

Apply your chosen chunking strategy. For most production systems: use semantic chunking (paragraph or section boundaries) as the base, with a maximum chunk size of 512–1024 tokens. Store chunk metadata that preserves the parent document context — you will need this for citation and source attribution.

If using parent-child chunking: create two embeddings per logical unit — the small child chunk for retrieval, and a reference to the larger parent chunk for context delivery.

Step 3: Embedding

Call your embedding model API (or run locally) for each chunk. This is typically the most expensive step in the indexing pipeline. For 100,000 chunks at 512 tokens each, using text-embedding-3-small at $0.02/million tokens, the cost is approximately $1.00. For 10 million chunks, it is approximately $100.

Batch your embedding calls to minimize API overhead. Most embedding APIs support batches of 100–2048 inputs per call.

Step 4: Storage

Upsert the vector + metadata into your vector database. Upsert (insert-or-update) is preferable to insert because it handles re-indexing when documents change. Include enough metadata to support filtering (by document type, date, access level) and source attribution.

Step 1: Query embedding

Embed the user’s query using the same model used for document indexing. This is critical — you cannot mix embedding models between indexing and retrieval.

Step 2: Vector search (and hybrid search)

Submit the query vector to your vector database. For hybrid search: also submit the query as a keyword search using the vector DB’s BM25 or full-text search capability. Merge results using Reciprocal Rank Fusion.

Step 3: Reranking (optional but recommended)

Submit the top-k retrieved chunks along with the original query to your reranker. Receive relevance scores. Keep the top 3–5 chunks by relevance score.

Step 4: Prompt assembly

Construct the final prompt: system instructions explaining the RAG task, the retrieved chunks (typically formatted as numbered sources), and the user’s question. Include instructions about citation format and what to do if the context does not contain the answer.

Step 5: LLM generation

Submit the assembled prompt to the LLM. The model generates an answer grounded in the retrieved context. A well-designed RAG prompt instructs the model to cite its sources by reference number and to say “I don’t have information on this” when the retrieved context is insufficient.


The gap between a naive RAG implementation and a production-grade one is large. Most of the quality improvement comes from systematic improvements to chunking, retrieval, and evaluation.

Naive RAG vs Advanced RAG

Naive RAG
Fixed chunks, vector search only, stuff-and-generate
  • Fixed-size chunking with arbitrary token boundaries
  • Single-vector semantic search only
  • Top-k chunks inserted verbatim into prompt
  • No reranking — retrieval order determines context priority
  • Simple to implement — works well on clean single-topic documents
  • Fast to prototype and evaluate against a baseline
VS
Advanced RAG
Semantic chunks, hybrid search, reranking, evaluation loop
  • Semantic or hierarchical chunking preserves context boundaries
  • Hybrid search (BM25 + vector) improves recall significantly
  • Reranker model scores chunks for precise relevance ordering
  • Context compression removes noise before LLM ingestion
  • RAGAS evaluation pipeline measures retrieval and answer quality
  • Significantly more implementation and infrastructure complexity
Verdict: Start with naive RAG to establish a baseline. Adopt advanced techniques selectively based on where your RAGAS evaluation shows quality gaps.
Use Naive RAG when…
Prototypes, demos, single-topic corpora with clean documents and lenient quality requirements
Use Advanced RAG when…
Production systems with heterogeneous document types, strict quality requirements, or large corpora

HyDE (Hypothetical Document Embeddings)

Instead of embedding the user’s raw question, ask the LLM to generate a hypothetical answer first. Embed the hypothetical answer, not the question. Search with that embedding.

Why this helps: questions and answers have different linguistic structure. A question embedding is optimized to match other questions. A hypothetical answer embedding is closer in the vector space to the actual answer documents. HyDE consistently improves retrieval recall, especially for short questions that lack context.

The cost: one extra LLM call per query. The latency impact: significant if not parallelized. HyDE is appropriate when retrieval recall is the bottleneck.

Multi-Query Retrieval

Generate 3–5 query variations from the original question using an LLM or a smaller model. Run retrieval for each variation. Merge and deduplicate the results before reranking.

Why this helps: the user’s original phrasing may not match the language used in the documents. Alternative phrasings capture different relevant chunks. Multi-query retrieval is one of the highest-leverage improvements for recall.

Contextual Retrieval (Anthropic)

Before embedding chunks, ask an LLM to prepend a brief context summary to each chunk: “This excerpt is from a Q4 2024 earnings call discussing gross margin pressure in the hardware division.” The enriched chunk is what gets embedded and stored.

Why this helps: chunks are often meaningful only in the context of their surrounding document. A chunk that says “The margin declined by 3.2%” is semantically ambiguous without knowing what document it came from. The prepended context makes the chunk self-contained for retrieval purposes.

Self-RAG and Agentic RAG

Standard RAG retrieves once per query. Agentic RAG treats retrieval as an action in a loop: retrieve, assess whether the retrieved context is sufficient, retrieve again with a refined query if not, and continue until the agent has enough information to answer.

This is more powerful for multi-hop questions but adds LLM calls, latency, and cost. See AI Agents and Agentic Systems for the agent architecture that underlies this pattern.


Corpus: 20,000 internal documents — policy PDFs, engineering RFCs, product specs, support KB articles.

Indexing configuration:

  • Loader: PyMuPDF for PDFs, HTML parser for web-based documents
  • Chunking: section-aware semantic chunking, max 800 tokens, 150-token overlap
  • Metadata: source_url, document_type, last_modified, department, access_level
  • Embedding model: text-embedding-3-large (quality requirement)
  • Vector DB: Qdrant with HNSW index, hybrid search enabled

Retrieval configuration:

  • Hybrid search: BM25 weight 0.3, vector weight 0.7
  • Initial retrieval: top-20 chunks
  • Reranker: Cohere Rerank, return top-5
  • Access control: filter by access_level at query time (critical for enterprise)

Key design decisions:

  • Access control filtering must happen at the vector DB level (metadata filter), not post-retrieval. Returning unauthorized documents to the LLM and hoping it ignores them is not a security model.
  • Section-aware chunking means each chunk maps to a complete policy section — no answer spans are split across chunks.
  • text-embedding-3-large costs 5x more than text-embedding-3-small but the quality requirement justified it for this corpus.

Example 2: Customer Support Agent with RAG

Section titled “Example 2: Customer Support Agent with RAG”

Corpus: Product documentation, known issue articles, troubleshooting guides — updated weekly.

Critical requirement: When a user asks about a bug fixed last Tuesday, the system must return the fix, not the pre-fix workaround from 6 months ago.

Solution: Incremental indexing with document-level versioning. Each document has a version and last_indexed_at metadata field. The indexing pipeline runs nightly, detects changed documents (via hash comparison or last-modified timestamp), removes the old vectors for changed documents, and re-indexes them.

This is one of the most underestimated operational requirements for production RAG systems. Static corpora are easy. Dynamic corpora require a full incremental indexing pipeline.


7. Trade-offs, Limitations, and Failure Modes

Section titled “7. Trade-offs, Limitations, and Failure Modes”

The most common production RAG failure is a retrieval system that returns chunks where the right answer exists in the corpus but is not retrieved — because the chunk that contains the answer is semantically different from the query phrasing.

Example: The user asks “How do I reset my password?” The relevant policy document uses the phrase “Account credential recovery procedure.” Fixed-size chunking may also have split the relevant paragraph across two chunks.

Mitigation: Semantic chunking on natural boundaries. Hybrid search for keyword coverage. Testing chunking decisions against a representative set of real queries before deploying.

LLMs have a documented tendency to underweight information positioned in the middle of long contexts. If you retrieve 10 chunks and the most relevant one is chunk #5, the LLM may produce a worse answer than if it were chunk #1 or chunk #10.

Mitigation: Rerank to put the most relevant chunks at the start and end of the context. Limit context to 3–5 high-quality chunks rather than stuffing 10 mediocre ones.

RAG reduces but does not eliminate hallucination. The LLM can:

  • Ignore the retrieved context and use parametric knowledge instead
  • Generate plausible-sounding details not present in the context
  • Misattribute a quote from one chunk to a different source

Mitigation: Explicit system prompt instructions to answer only from context and to cite sources by reference number. RAGAS faithfulness scoring to measure how often answers are actually grounded in the retrieved context.

If retrieved chunks plus the user query exceed the LLM’s context window, either the prompt is truncated (losing information) or the API call fails. For large retrieval sets or long documents, this is a real production issue.

Mitigation: Enforce a strict token budget for retrieved context. Reranking helps by keeping only the most relevant chunks. Context compression (extracting only the relevant sentences from each chunk) can reduce token count by 50–70%.

Most production RAG systems are deployed without a labeled evaluation set — there is no pre-existing dataset of questions with known correct answers. Evaluation must be created from scratch, which requires dedicated time and tooling.

RAGAS (Retrieval Augmented Generation Assessment) provides reference-free metrics that partially address this:

MetricWhat It Measures
FaithfulnessIs the generated answer grounded in the retrieved context?
Answer RelevancyIs the answer relevant to the question asked?
Context RecallDid retrieval return the chunks needed to answer?
Context PrecisionAre the retrieved chunks actually useful (not noisy)?

RAGAS can use an LLM-as-judge approach to score these metrics without human labels. This is not a perfect substitute for human evaluation, but it enables continuous monitoring and regression detection.


RAG architecture is one of the most common technical interview topics for GenAI engineering roles in 2025–2026. Most candidates know what RAG is. Interviewers are testing whether you can go beyond the definition.

1. Can you explain why naive RAG fails in production?

Interviewers expect you to articulate specific failure modes: poor chunking bisecting context, pure vector search missing exact terms, top-k stuffing causing lost-in-the-middle degradation. Generic “it’s not accurate enough” answers are not sufficient.

2. Can you describe how to improve retrieval quality systematically?

Walk through the improvement chain: semantic chunking → hybrid search → reranking → context compression. Explain the tradeoff each adds (quality vs latency/cost).

3. Can you design an evaluation framework?

“How do you know if your RAG system is working?” The answer should include RAGAS metrics, an offline evaluation dataset (even synthetic), and production monitoring using proxy metrics (thumbs up/down, escalation rate, answer length as a faithfulness proxy).

4. Can you handle dynamic corpora?

“What happens when a document in your knowledge base changes?” The answer must cover incremental indexing — not full re-indexing from scratch every time.

  • Explain the two-pipeline architecture of a RAG system
  • What is chunking and how do you choose a chunking strategy?
  • What is hybrid search and why is it better than pure vector search?
  • What is reranking and when would you use it?
  • How do you evaluate a RAG system?
  • Design a RAG system for a corpus of 1 million documents with sub-200ms retrieval latency
  • How do you handle documents that change frequently?
  • What is HyDE and when does it help?
  • What is the difference between RAG and fine-tuning? When do you use each?

A production RAG system serving interactive queries must complete the full retrieval-augment-generate cycle within an acceptable latency window — typically under 3–5 seconds for most applications, with streaming used to improve perceived responsiveness.

The retrieval step itself must be fast. Vector search on a well-configured HNSW index with 1 million vectors typically completes in under 50ms. Reranking adds 100–200ms for a batch of 20 candidates. LLM generation dominates — 1–3 seconds for most queries depending on output length and model.

Where production engineers spend time on latency:

  • Batch embedding calls to avoid per-chunk API overhead
  • Cache query embeddings for repeated or similar queries
  • Parallelize multi-query retrieval — all query variations run concurrently
  • Use streaming for LLM generation — return first tokens to the user while the rest generate

Vector database sizing: A typical production deployment stores vectors for millions of documents. Memory requirements depend on vector dimensions. A system with 10 million 1536-dimensional float32 vectors requires approximately 60GB RAM for in-memory HNSW. Qdrant, Weaviate, and Pinecone all support quantization to reduce memory footprint at a small quality cost.

Indexing pipeline orchestration: The offline indexing pipeline needs to be treated as a first-class engineering concern. It should be monitored, have retry logic for failed embeddings, support incremental updates, and emit metrics on indexing lag (how stale is the current index relative to the source documents).

Access control: In enterprise deployments, users must only retrieve documents they are authorized to access. This requires metadata-filtered vector search — filters applied at the vector DB level before retrieval, not post-retrieval filtering. Post-retrieval filtering defeats the purpose of top-k retrieval.

Standard application metrics (p50/p99 latency, error rate) are necessary but insufficient for RAG systems. Additional metrics that matter in production:

  • Retrieval recall proxy: What fraction of answers include a citation from the retrieved chunks vs generating without citation?
  • No-answer rate: How often does the system respond with “I don’t have information on this”? Unexpected spikes indicate retrieval failures.
  • Context token utilization: Average tokens consumed by retrieved context. A sudden drop may indicate retrieval quality degradation.
  • User feedback signals: Thumbs up/down, follow-up clarification rate, escalation to human rate

RAG is two pipelines sharing a vector database. The offline indexing pipeline transforms documents into searchable vectors. The online retrieval pipeline finds the relevant vectors for each query and injects the source text into the LLM’s prompt. Every quality improvement comes from improving one of these pipelines or the interface between them.

Decision PointRecommendation
Chunking strategySemantic/section-aware, not fixed-size
Chunk size512–1024 tokens with 10–20% overlap
Embedding modelMatch quality to requirements; benchmark on your corpus
Search strategyHybrid (BM25 + vector) as default
RerankingYes for any corpus over 50K chunks
Access controlMetadata filter at the DB level, never post-retrieval
EvaluationRAGAS metrics from day one
Dynamic corporaIncremental indexing pipeline
ScenarioRAG Appropriate?
Private/internal knowledge baseYes — core use case
Frequently updated informationYes — no retraining required
Need source citationsYes — retrieval provides grounding
Need to change behavior/toneNo — consider fine-tuning or prompt engineering
Knowledge is static and smallMaybe — consider in-context stuffing instead
Need factual recall of common knowledgeNo — model parametric knowledge is sufficient

Official Documentation and Further Reading

Section titled “Official Documentation and Further Reading”

Frameworks:

Evaluation:

Research:


Last updated: February 2026. RAG framework APIs evolve rapidly; verify specifics against current documentation.