RAG Architecture Guide — Retrieval-Augmented Generation for Production
1. Introduction and Motivation
Section titled “1. Introduction and Motivation”Why RAG Exists
Section titled “Why RAG Exists”Large language models are trained on a static snapshot of the internet with a hard knowledge cutoff. Once training ends, they cannot update what they know. They have no access to your private documents, internal wikis, customer data, real-time information, or anything that did not exist in their training corpus.
This creates a fundamental problem for production applications. A customer service bot needs to know your current product catalog. A legal tool needs to reference the specific contract at hand. An enterprise search system needs to surface internal documentation that was written last week. The LLM’s parametric knowledge — the knowledge baked into its weights — cannot satisfy these requirements.
The obvious solution is to supply the missing information directly. Include the relevant documents in the prompt. Let the LLM reason over that context rather than trying to recall something it was never trained on. This is the core idea behind Retrieval-Augmented Generation: at query time, retrieve the relevant documents from a knowledge base and inject them into the prompt before the LLM responds.
The retrieval step is what makes this practical at scale. You cannot fit an entire enterprise document corpus into a single prompt — context windows are finite and LLM costs scale with token count. Retrieval narrows the corpus to only the documents most relevant to the current query, keeping prompts manageable and costs bounded.
RAG is the dominant architecture for knowledge-intensive LLM applications because it solves the knowledge currency problem without requiring expensive model retraining, and it provides a form of grounding — the LLM’s answer is traceable back to specific source documents.
What You Will Learn
Section titled “What You Will Learn”This guide covers:
- The two-pipeline architecture of every RAG system: indexing and retrieval
- Chunking strategies and why they matter more than most engineers expect
- How embedding models and vector databases enable semantic search
- Hybrid search, reranking, and the retrieval quality improvement chain
- Advanced patterns including HyDE, multi-query retrieval, and contextual retrieval
- How to evaluate a RAG system using RAGAS metrics
- Where RAG fails in production and how to design against those failures
- What interviewers expect when discussing RAG architecture
2. Real-World Problem Context
Section titled “2. Real-World Problem Context”The Document Q&A Problem at Scale
Section titled “The Document Q&A Problem at Scale”Consider a company with 50,000 internal documents: engineering RFCs, HR policies, product specs, customer contracts, meeting notes, support tickets. A new engineer asks: “What is our company’s policy on rotating database credentials?” Searching manually through a 50,000-document corpus is infeasible. A keyword search returns 300 documents, most irrelevant. A traditional search engine returns results ranked by keyword overlap, not by semantic relevance to the question.
RAG solves this exactly. The offline indexing pipeline processes all 50,000 documents, chunks them, embeds them, and stores them in a vector database. At query time, the engineer’s question is embedded and the vector database returns the top 5 most semantically similar chunks — likely the security policy section that describes credential rotation requirements. Those chunks are injected into the prompt, and the LLM generates a specific, grounded answer.
This is not a contrived example. Enterprise search was one of the first RAG applications deployed at scale, starting in 2023, and remains one of the most common production use cases.
Why This Is Harder Than It Looks
Section titled “Why This Is Harder Than It Looks”The demo version of RAG is straightforward: load a PDF, chunk it into fixed-size paragraphs, embed them, store in a local vector DB, run a similarity search on the user’s question, stuff the results into a prompt. This works well for demos on clean, single-topic documents.
Production RAG fails in ways the demo never encounters:
- Heterogeneous documents: A mix of PDFs, HTML pages, Excel files, email threads, database records, and code files — each requiring different preprocessing
- Document quality variance: Scanned PDFs with poor OCR, documents with heavy table/image content, deeply nested HTML that produces garbage text after extraction
- Poor retrieval recall: The right document exists but is not retrieved because the query phrasing is different from the document’s language
- Poor context precision: Retrieval returns 5 chunks, but 4 are tangentially related noise that confuses the LLM and degrades answer quality
- Multi-hop questions: “What changed between the 2023 and 2024 versions of our security policy?” requires retrieving from two specific documents and comparing them
- Latency at scale: A vector search that completes in 50ms for 10,000 documents may take 300ms for 10 million
Real RAG engineering is mostly about fixing these production failures systematically.
3. Core Concepts and Mental Model
Section titled “3. Core Concepts and Mental Model”The Two-Pipeline Architecture
Section titled “The Two-Pipeline Architecture”Every RAG system has two distinct execution paths that must be understood and optimized independently:
The indexing pipeline runs offline, asynchronously, when your knowledge base changes. It transforms raw documents into a form that enables fast, relevant retrieval at query time. It is batch-oriented, latency-tolerant, and can be expensive computationally.
The retrieval pipeline runs online, synchronously, on every user query. It must be fast — typically under 200ms for the retrieval step alone. It is the user-facing critical path.
These two pipelines share a critical interface: the vector database. The indexing pipeline writes vectors into it; the retrieval pipeline reads from it. Every decision about chunking, embedding, and indexing made offline directly determines retrieval quality online.
📊 Visual Explanation
Section titled “📊 Visual Explanation”RAG System Architecture — Two Pipelines
The offline indexing pipeline prepares your knowledge base. The online retrieval pipeline serves every query. Both connect through the vector store.
Chunking
Section titled “Chunking”Chunking is the process of splitting documents into retrievable units. The vector database indexes chunks, not whole documents. Retrieval returns chunks, not whole documents. The granularity of your chunks determines both what can be retrieved and what context the LLM receives.
This decision has more impact on RAG quality than most engineers expect. Common chunking strategies:
Fixed-size chunking: Split every N tokens with an overlap of M tokens. Simple, fast, and predictable. The overlap prevents context loss at chunk boundaries. The main weakness: splits can bisect sentences, paragraphs, or conceptual units arbitrarily, creating chunks that make no sense in isolation.
Semantic chunking: Split on natural document boundaries — paragraphs, sections, sentences — rather than token count. Produces chunks that are semantically coherent. More expensive to compute. The main weakness: chunk sizes vary widely, making it harder to reason about context window usage.
Hierarchical chunking (parent-child): Index small chunks for precise retrieval but return larger parent chunks to the LLM for context. The small chunk gets matched by similarity; the large chunk provides the surrounding context needed for a coherent answer. This is the most effective approach for most production systems.
Embedding Models
Section titled “Embedding Models”An embedding model converts a chunk of text into a dense vector — a list of hundreds or thousands of floating-point numbers that encodes the text’s semantic meaning. Chunks with similar meaning have vectors that are close in the high-dimensional space; chunks with different meanings have vectors that are far apart.
The quality of your embedding model is the quality ceiling for your retrieval. A retrieval system that uses a weak embedding model cannot be fixed by improving the vector search algorithm — the signal is lost before search begins.
Common options in 2026:
| Model | Dimensions | Best For |
|---|---|---|
text-embedding-3-small (OpenAI) | 1536 | Cost-efficient general use |
text-embedding-3-large (OpenAI) | 3072 | Higher quality, higher cost |
embed-english-v3.0 (Cohere) | 1024 | Strong general quality |
nomic-embed-text-v1.5 (open source) | 768 | Self-hosted deployments |
bge-large-en-v1.5 (BAAI, open source) | 1024 | Strong open-source baseline |
Vector Databases
Section titled “Vector Databases”A vector database stores embeddings alongside their source text and metadata, and supports fast approximate nearest-neighbor (ANN) search. ANN returns vectors close to a query vector without checking every vector in the database — this is what makes vector search fast at scale.
Most vector databases use HNSW (Hierarchical Navigable Small World) indexing, a graph-based ANN algorithm that provides excellent speed-quality tradeoffs.
See Vector Database Comparison for a detailed comparison of Pinecone, Weaviate, Qdrant, and Chroma.
Hybrid Search
Section titled “Hybrid Search”Pure vector search retrieves semantically similar documents. It handles synonyms and paraphrasing well. But it struggles with exact matches — rare proper nouns, product codes, technical identifiers. A query for “JIRA-12345” might retrieve semantically similar tickets rather than that specific ticket.
Hybrid search combines vector search with keyword search (typically BM25, a classical term-frequency algorithm). The two result sets are merged using Reciprocal Rank Fusion (RRF) or a weighted combination. The result is significantly better recall: semantic search handles meaning, keyword search handles precision on exact terms.
Hybrid search is the default choice for production systems with heterogeneous document types.
📊 Visual Explanation
Section titled “📊 Visual Explanation”RAG Retrieval Quality Stack
Each layer improves retrieval quality. Most production systems use hybrid search and reranking on top of basic vector search.
Reranking
Section titled “Reranking”The initial retrieval step returns the top-k most similar chunks (typically k=10–20) using approximate nearest-neighbor search. This is fast but approximate — the similarity scores are directional, not precise relevance scores.
Reranking applies a more expensive but more accurate model to score each retrieved chunk against the query. The reranker has access to both the query and the candidate chunk simultaneously, allowing it to model their interaction directly. The top 3–5 chunks after reranking are what actually go into the prompt.
Reranking models are typically cross-encoders — models that read both the query and the candidate document as a single input. Popular options include Cohere Rerank, together with open-source cross-encoders from sentence-transformers. The cost is one reranker inference call per retrieved chunk — acceptable overhead for the quality improvement.
4. Step-by-Step Explanation
Section titled “4. Step-by-Step Explanation”Building the Indexing Pipeline
Section titled “Building the Indexing Pipeline”Step 1: Document loading
Every document type requires a different loader. PDFs require OCR or text extraction (PyMuPDF or pdfplumber for simple PDFs; Tesseract or Azure Document Intelligence for scanned or complex layouts). Web pages require HTML parsing (BeautifulSoup or Trafilatura). Databases require SQL queries. Each loader must handle encoding issues, corrupt files, and access control.
The output of this step is a list of Document objects, each containing the full text and metadata (source URL, file path, creation date, author, section headings).
Step 2: Chunking
Apply your chosen chunking strategy. For most production systems: use semantic chunking (paragraph or section boundaries) as the base, with a maximum chunk size of 512–1024 tokens. Store chunk metadata that preserves the parent document context — you will need this for citation and source attribution.
If using parent-child chunking: create two embeddings per logical unit — the small child chunk for retrieval, and a reference to the larger parent chunk for context delivery.
Step 3: Embedding
Call your embedding model API (or run locally) for each chunk. This is typically the most expensive step in the indexing pipeline. For 100,000 chunks at 512 tokens each, using text-embedding-3-small at $0.02/million tokens, the cost is approximately $1.00. For 10 million chunks, it is approximately $100.
Batch your embedding calls to minimize API overhead. Most embedding APIs support batches of 100–2048 inputs per call.
Step 4: Storage
Upsert the vector + metadata into your vector database. Upsert (insert-or-update) is preferable to insert because it handles re-indexing when documents change. Include enough metadata to support filtering (by document type, date, access level) and source attribution.
Building the Retrieval Pipeline
Section titled “Building the Retrieval Pipeline”Step 1: Query embedding
Embed the user’s query using the same model used for document indexing. This is critical — you cannot mix embedding models between indexing and retrieval.
Step 2: Vector search (and hybrid search)
Submit the query vector to your vector database. For hybrid search: also submit the query as a keyword search using the vector DB’s BM25 or full-text search capability. Merge results using Reciprocal Rank Fusion.
Step 3: Reranking (optional but recommended)
Submit the top-k retrieved chunks along with the original query to your reranker. Receive relevance scores. Keep the top 3–5 chunks by relevance score.
Step 4: Prompt assembly
Construct the final prompt: system instructions explaining the RAG task, the retrieved chunks (typically formatted as numbered sources), and the user’s question. Include instructions about citation format and what to do if the context does not contain the answer.
Step 5: LLM generation
Submit the assembled prompt to the LLM. The model generates an answer grounded in the retrieved context. A well-designed RAG prompt instructs the model to cite its sources by reference number and to say “I don’t have information on this” when the retrieved context is insufficient.
5. Architecture and System View
Section titled “5. Architecture and System View”Naive RAG vs Advanced RAG
Section titled “Naive RAG vs Advanced RAG”The gap between a naive RAG implementation and a production-grade one is large. Most of the quality improvement comes from systematic improvements to chunking, retrieval, and evaluation.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Naive RAG vs Advanced RAG
- Fixed-size chunking with arbitrary token boundaries
- Single-vector semantic search only
- Top-k chunks inserted verbatim into prompt
- No reranking — retrieval order determines context priority
- Simple to implement — works well on clean single-topic documents
- Fast to prototype and evaluate against a baseline
- Semantic or hierarchical chunking preserves context boundaries
- Hybrid search (BM25 + vector) improves recall significantly
- Reranker model scores chunks for precise relevance ordering
- Context compression removes noise before LLM ingestion
- RAGAS evaluation pipeline measures retrieval and answer quality
- Significantly more implementation and infrastructure complexity
Advanced Retrieval Patterns
Section titled “Advanced Retrieval Patterns”HyDE (Hypothetical Document Embeddings)
Instead of embedding the user’s raw question, ask the LLM to generate a hypothetical answer first. Embed the hypothetical answer, not the question. Search with that embedding.
Why this helps: questions and answers have different linguistic structure. A question embedding is optimized to match other questions. A hypothetical answer embedding is closer in the vector space to the actual answer documents. HyDE consistently improves retrieval recall, especially for short questions that lack context.
The cost: one extra LLM call per query. The latency impact: significant if not parallelized. HyDE is appropriate when retrieval recall is the bottleneck.
Multi-Query Retrieval
Generate 3–5 query variations from the original question using an LLM or a smaller model. Run retrieval for each variation. Merge and deduplicate the results before reranking.
Why this helps: the user’s original phrasing may not match the language used in the documents. Alternative phrasings capture different relevant chunks. Multi-query retrieval is one of the highest-leverage improvements for recall.
Contextual Retrieval (Anthropic)
Before embedding chunks, ask an LLM to prepend a brief context summary to each chunk: “This excerpt is from a Q4 2024 earnings call discussing gross margin pressure in the hardware division.” The enriched chunk is what gets embedded and stored.
Why this helps: chunks are often meaningful only in the context of their surrounding document. A chunk that says “The margin declined by 3.2%” is semantically ambiguous without knowing what document it came from. The prepended context makes the chunk self-contained for retrieval purposes.
Self-RAG and Agentic RAG
Standard RAG retrieves once per query. Agentic RAG treats retrieval as an action in a loop: retrieve, assess whether the retrieved context is sufficient, retrieve again with a refined query if not, and continue until the agent has enough information to answer.
This is more powerful for multi-hop questions but adds LLM calls, latency, and cost. See AI Agents and Agentic Systems for the agent architecture that underlies this pattern.
6. Practical Examples
Section titled “6. Practical Examples”Example 1: Enterprise Document Q&A System
Section titled “Example 1: Enterprise Document Q&A System”Corpus: 20,000 internal documents — policy PDFs, engineering RFCs, product specs, support KB articles.
Indexing configuration:
- Loader: PyMuPDF for PDFs, HTML parser for web-based documents
- Chunking: section-aware semantic chunking, max 800 tokens, 150-token overlap
- Metadata:
source_url,document_type,last_modified,department,access_level - Embedding model:
text-embedding-3-large(quality requirement) - Vector DB: Qdrant with HNSW index, hybrid search enabled
Retrieval configuration:
- Hybrid search: BM25 weight 0.3, vector weight 0.7
- Initial retrieval: top-20 chunks
- Reranker: Cohere Rerank, return top-5
- Access control: filter by
access_levelat query time (critical for enterprise)
Key design decisions:
- Access control filtering must happen at the vector DB level (metadata filter), not post-retrieval. Returning unauthorized documents to the LLM and hoping it ignores them is not a security model.
- Section-aware chunking means each chunk maps to a complete policy section — no answer spans are split across chunks.
text-embedding-3-largecosts 5x more thantext-embedding-3-smallbut the quality requirement justified it for this corpus.
Example 2: Customer Support Agent with RAG
Section titled “Example 2: Customer Support Agent with RAG”Corpus: Product documentation, known issue articles, troubleshooting guides — updated weekly.
Critical requirement: When a user asks about a bug fixed last Tuesday, the system must return the fix, not the pre-fix workaround from 6 months ago.
Solution: Incremental indexing with document-level versioning. Each document has a version and last_indexed_at metadata field. The indexing pipeline runs nightly, detects changed documents (via hash comparison or last-modified timestamp), removes the old vectors for changed documents, and re-indexes them.
This is one of the most underestimated operational requirements for production RAG systems. Static corpora are easy. Dynamic corpora require a full incremental indexing pipeline.
7. Trade-offs, Limitations, and Failure Modes
Section titled “7. Trade-offs, Limitations, and Failure Modes”The Chunking Failure Mode
Section titled “The Chunking Failure Mode”The most common production RAG failure is a retrieval system that returns chunks where the right answer exists in the corpus but is not retrieved — because the chunk that contains the answer is semantically different from the query phrasing.
Example: The user asks “How do I reset my password?” The relevant policy document uses the phrase “Account credential recovery procedure.” Fixed-size chunking may also have split the relevant paragraph across two chunks.
Mitigation: Semantic chunking on natural boundaries. Hybrid search for keyword coverage. Testing chunking decisions against a representative set of real queries before deploying.
The Lost-in-the-Middle Problem
Section titled “The Lost-in-the-Middle Problem”LLMs have a documented tendency to underweight information positioned in the middle of long contexts. If you retrieve 10 chunks and the most relevant one is chunk #5, the LLM may produce a worse answer than if it were chunk #1 or chunk #10.
Mitigation: Rerank to put the most relevant chunks at the start and end of the context. Limit context to 3–5 high-quality chunks rather than stuffing 10 mediocre ones.
Hallucination in RAG Systems
Section titled “Hallucination in RAG Systems”RAG reduces but does not eliminate hallucination. The LLM can:
- Ignore the retrieved context and use parametric knowledge instead
- Generate plausible-sounding details not present in the context
- Misattribute a quote from one chunk to a different source
Mitigation: Explicit system prompt instructions to answer only from context and to cite sources by reference number. RAGAS faithfulness scoring to measure how often answers are actually grounded in the retrieved context.
Context Window Overflow
Section titled “Context Window Overflow”If retrieved chunks plus the user query exceed the LLM’s context window, either the prompt is truncated (losing information) or the API call fails. For large retrieval sets or long documents, this is a real production issue.
Mitigation: Enforce a strict token budget for retrieved context. Reranking helps by keeping only the most relevant chunks. Context compression (extracting only the relevant sentences from each chunk) can reduce token count by 50–70%.
Evaluation Without Ground Truth
Section titled “Evaluation Without Ground Truth”Most production RAG systems are deployed without a labeled evaluation set — there is no pre-existing dataset of questions with known correct answers. Evaluation must be created from scratch, which requires dedicated time and tooling.
RAGAS (Retrieval Augmented Generation Assessment) provides reference-free metrics that partially address this:
| Metric | What It Measures |
|---|---|
| Faithfulness | Is the generated answer grounded in the retrieved context? |
| Answer Relevancy | Is the answer relevant to the question asked? |
| Context Recall | Did retrieval return the chunks needed to answer? |
| Context Precision | Are the retrieved chunks actually useful (not noisy)? |
RAGAS can use an LLM-as-judge approach to score these metrics without human labels. This is not a perfect substitute for human evaluation, but it enables continuous monitoring and regression detection.
8. Interview Perspective
Section titled “8. Interview Perspective”What Interviewers Are Assessing
Section titled “What Interviewers Are Assessing”RAG architecture is one of the most common technical interview topics for GenAI engineering roles in 2025–2026. Most candidates know what RAG is. Interviewers are testing whether you can go beyond the definition.
1. Can you explain why naive RAG fails in production?
Interviewers expect you to articulate specific failure modes: poor chunking bisecting context, pure vector search missing exact terms, top-k stuffing causing lost-in-the-middle degradation. Generic “it’s not accurate enough” answers are not sufficient.
2. Can you describe how to improve retrieval quality systematically?
Walk through the improvement chain: semantic chunking → hybrid search → reranking → context compression. Explain the tradeoff each adds (quality vs latency/cost).
3. Can you design an evaluation framework?
“How do you know if your RAG system is working?” The answer should include RAGAS metrics, an offline evaluation dataset (even synthetic), and production monitoring using proxy metrics (thumbs up/down, escalation rate, answer length as a faithfulness proxy).
4. Can you handle dynamic corpora?
“What happens when a document in your knowledge base changes?” The answer must cover incremental indexing — not full re-indexing from scratch every time.
Common Interview Questions on RAG
Section titled “Common Interview Questions on RAG”- Explain the two-pipeline architecture of a RAG system
- What is chunking and how do you choose a chunking strategy?
- What is hybrid search and why is it better than pure vector search?
- What is reranking and when would you use it?
- How do you evaluate a RAG system?
- Design a RAG system for a corpus of 1 million documents with sub-200ms retrieval latency
- How do you handle documents that change frequently?
- What is HyDE and when does it help?
- What is the difference between RAG and fine-tuning? When do you use each?
9. Production Perspective
Section titled “9. Production Perspective”Latency Budget
Section titled “Latency Budget”A production RAG system serving interactive queries must complete the full retrieval-augment-generate cycle within an acceptable latency window — typically under 3–5 seconds for most applications, with streaming used to improve perceived responsiveness.
The retrieval step itself must be fast. Vector search on a well-configured HNSW index with 1 million vectors typically completes in under 50ms. Reranking adds 100–200ms for a batch of 20 candidates. LLM generation dominates — 1–3 seconds for most queries depending on output length and model.
Where production engineers spend time on latency:
- Batch embedding calls to avoid per-chunk API overhead
- Cache query embeddings for repeated or similar queries
- Parallelize multi-query retrieval — all query variations run concurrently
- Use streaming for LLM generation — return first tokens to the user while the rest generate
Infrastructure Considerations
Section titled “Infrastructure Considerations”Vector database sizing: A typical production deployment stores vectors for millions of documents. Memory requirements depend on vector dimensions. A system with 10 million 1536-dimensional float32 vectors requires approximately 60GB RAM for in-memory HNSW. Qdrant, Weaviate, and Pinecone all support quantization to reduce memory footprint at a small quality cost.
Indexing pipeline orchestration: The offline indexing pipeline needs to be treated as a first-class engineering concern. It should be monitored, have retry logic for failed embeddings, support incremental updates, and emit metrics on indexing lag (how stale is the current index relative to the source documents).
Access control: In enterprise deployments, users must only retrieve documents they are authorized to access. This requires metadata-filtered vector search — filters applied at the vector DB level before retrieval, not post-retrieval filtering. Post-retrieval filtering defeats the purpose of top-k retrieval.
Monitoring What Matters
Section titled “Monitoring What Matters”Standard application metrics (p50/p99 latency, error rate) are necessary but insufficient for RAG systems. Additional metrics that matter in production:
- Retrieval recall proxy: What fraction of answers include a citation from the retrieved chunks vs generating without citation?
- No-answer rate: How often does the system respond with “I don’t have information on this”? Unexpected spikes indicate retrieval failures.
- Context token utilization: Average tokens consumed by retrieved context. A sudden drop may indicate retrieval quality degradation.
- User feedback signals: Thumbs up/down, follow-up clarification rate, escalation to human rate
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”The Core Mental Model
Section titled “The Core Mental Model”RAG is two pipelines sharing a vector database. The offline indexing pipeline transforms documents into searchable vectors. The online retrieval pipeline finds the relevant vectors for each query and injects the source text into the LLM’s prompt. Every quality improvement comes from improving one of these pipelines or the interface between them.
RAG Design Checklist
Section titled “RAG Design Checklist”| Decision Point | Recommendation |
|---|---|
| Chunking strategy | Semantic/section-aware, not fixed-size |
| Chunk size | 512–1024 tokens with 10–20% overlap |
| Embedding model | Match quality to requirements; benchmark on your corpus |
| Search strategy | Hybrid (BM25 + vector) as default |
| Reranking | Yes for any corpus over 50K chunks |
| Access control | Metadata filter at the DB level, never post-retrieval |
| Evaluation | RAGAS metrics from day one |
| Dynamic corpora | Incremental indexing pipeline |
When RAG Is the Right Choice
Section titled “When RAG Is the Right Choice”| Scenario | RAG Appropriate? |
|---|---|
| Private/internal knowledge base | Yes — core use case |
| Frequently updated information | Yes — no retraining required |
| Need source citations | Yes — retrieval provides grounding |
| Need to change behavior/tone | No — consider fine-tuning or prompt engineering |
| Knowledge is static and small | Maybe — consider in-context stuffing instead |
| Need factual recall of common knowledge | No — model parametric knowledge is sufficient |
Official Documentation and Further Reading
Section titled “Official Documentation and Further Reading”Frameworks:
- LangChain RAG Documentation — Comprehensive RAG tutorials and integrations
- LlamaIndex Documentation — RAG-first framework with advanced retrieval modules
- Haystack — Production-focused RAG framework from Deepset
Evaluation:
- RAGAS Documentation — Reference-free RAG evaluation metrics
- Anthropic Contextual Retrieval — Anthropic’s research on context-aware chunking
Research:
- RAG Survey (Gao et al., 2023) — Comprehensive survey of RAG techniques
Related
Section titled “Related”- Vector Database Comparison — Pinecone vs Weaviate vs Qdrant vs Chroma for production RAG
- Fine-Tuning vs RAG — When to use each approach and how to combine them
- AI Agents and Agentic Systems — How Agentic RAG uses the ReAct loop for iterative retrieval
- Prompt Engineering — How to write the RAG system prompt that instructs the LLM to use context correctly
- Essential GenAI Tools — The full production tool stack including vector DBs and observability
- GenAI Interview Questions — Practice questions on RAG architecture
Last updated: February 2026. RAG framework APIs evolve rapidly; verify specifics against current documentation.