Advanced RAG — Hybrid Search, Reranking & Knowledge Graphs (2026)
You have built a basic RAG system. It works in demos. It fails in production. Retrieval misses exact terms, returns noisy context, and breaks on multi-hop questions. This guide covers the techniques that separate demo-quality RAG from production-grade RAG: hybrid search, cross-encoder reranking, query transformation, and knowledge graph augmentation.
1. The Gap Between Basic and Advanced RAG
Section titled “1. The Gap Between Basic and Advanced RAG”Basic RAG follows a simple pattern: embed the user query, run a vector similarity search, take the top-k chunks, inject them into the prompt. This works well on clean single-topic documents in controlled conditions. Production systems break this assumption immediately.
Where Basic RAG Fails
Section titled “Where Basic RAG Fails”Three failure modes account for most production RAG quality problems:
Retrieval recall failures. The right chunk exists in your corpus but is never retrieved. The user asks “How do I rotate API credentials?” and the relevant document uses the phrase “credential rotation procedure” — the semantic distance is small enough to look safe but large enough to drop the correct chunk out of the top-10 results. Multiply this across a heterogeneous enterprise corpus and recall becomes unreliable.
Retrieval precision failures. The top-k returned chunks are loosely related to the topic but do not actually contain the answer. The LLM receives noisy context, generates a vague or hallucinated response, and the user has no way to know the relevant document was not retrieved. This is the most insidious failure mode because the system appears to be working.
Multi-hop failures. The user asks a question whose answer requires combining facts from two or more documents — “What is the difference between how our Q3 and Q4 policies handle exception approvals?” No single chunk contains the answer, and a vector search returns chunks from both documents without the relational structure needed to connect them.
Each failure mode has a specific solution. Recall failures are solved by hybrid search and query transformation. Precision failures are solved by reranking. Multi-hop failures are solved by knowledge graph augmentation. Production RAG systems layer these techniques in sequence.
2. Beyond Basic RAG — The Retrieval Quality Stack
Section titled “2. Beyond Basic RAG — The Retrieval Quality Stack”Before diving into individual techniques, it is useful to understand how they compose. Advanced RAG is not a single algorithm — it is a pipeline of improvements, each targeting a specific failure mode.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Advanced RAG — Retrieval Quality Stack
Each layer targets a specific failure mode. Implement in order: hybrid search first, then reranking, then query transformation, then knowledge graph for multi-hop.
The stack reads bottom-up: start with basic vector search and add layers where your evaluation metrics show quality gaps. Most production systems need the first two layers (query transformation + hybrid search). The third (reranking) adds meaningful quality gains for corpora over 50,000 chunks. The fourth (knowledge graph) is warranted for enterprise knowledge bases where multi-hop questions are common.
3. Hybrid Search — Dense + Sparse Retrieval
Section titled “3. Hybrid Search — Dense + Sparse Retrieval”Hybrid search combines dense vector retrieval with BM25 sparse keyword search, merging both ranked lists via Reciprocal Rank Fusion to recover exact-match terms that embedding models miss.
Why Dense-Only Retrieval Has a Blind Spot
Section titled “Why Dense-Only Retrieval Has a Blind Spot”Embedding models are trained to capture semantic meaning — the gist of what a sentence is about. They excel at handling synonyms, paraphrasing, and conceptual similarity. They are poor at distinguishing exact terms: a product SKU, a contract clause number, a person’s name, a specific error code.
When a user asks “What does error code E-4291 mean?”, a vector search returns chunks about error handling in general, not the specific chunk containing E-4291’s definition. The embedding for “E-4291” and the embedding for “system error” are close in the vector space. The embedding for “E-4291 definition” and “E-4291: disk quota exceeded” are far apart because the model has never seen this code in training.
Sparse retrieval — specifically BM25 (Best Matching 25), a classical information retrieval algorithm — solves this exactly. BM25 is a term-frequency algorithm that scores documents by the presence and frequency of exact query terms, adjusted for document length. It has no concept of semantics, but it is extremely precise for exact-match retrieval.
How Reciprocal Rank Fusion Works
Section titled “How Reciprocal Rank Fusion Works”Hybrid search runs both retrieval methods in parallel and merges results. The merging algorithm that consistently outperforms weighted averaging is Reciprocal Rank Fusion (RRF):
RRF_score(doc) = Σ 1 / (k + rank_in_list)Where k is a smoothing constant (typically 60) and the sum is over each ranked list. A document ranked #1 in the vector search list and #3 in the BM25 list gets a higher combined score than a document that ranks #1 in one list but does not appear in the other. RRF is robust to score scale differences between dense and sparse retrievers — it only cares about rank position, not raw scores.
Hybrid Search Implementation
Section titled “Hybrid Search Implementation”Most modern vector databases support hybrid search natively. Here is the pattern using Qdrant:
from qdrant_client import QdrantClientfrom qdrant_client.models import ( SparseVector, NamedVector, NamedSparseVector, SearchRequest, FusionQuery, Fusion,)
client = QdrantClient(url="http://localhost:6333")
def hybrid_search(query: str, collection: str, top_k: int = 20) -> list: # Dense retrieval: embed the query dense_vector = embed(query) # returns List[float]
# Sparse retrieval: tokenize + compute BM25 weights sparse_vector = bm25_encode(query) # returns {indices: [...], values: [...]}
results = client.query_points( collection_name=collection, prefetch=[ # Dense search {"query": dense_vector, "using": "dense", "limit": 50}, # Sparse BM25 search {"query": SparseVector(**sparse_vector), "using": "sparse", "limit": 50}, ], # Merge both result sets using Reciprocal Rank Fusion query=FusionQuery(fusion=Fusion.RRF), limit=top_k, ) return results.pointsThe prefetch block runs both searches independently. FusionQuery(fusion=Fusion.RRF) applies RRF to merge them. The output is a single ranked list of up to top_k chunks, combining the best of semantic and keyword retrieval.
Weaviate and Pinecone offer equivalent APIs. Weaviate calls it hybrid search with an alpha parameter (0 = pure BM25, 1 = pure vector, 0.5 = balanced). For most production systems, alpha=0.75 (vector-dominant with BM25 correction) is a reasonable starting point to tune from.
When Hybrid Search Matters Most
Section titled “When Hybrid Search Matters Most”Hybrid search delivers its largest gains on corpora with:
- Technical identifiers: error codes, product SKUs, contract numbers, ticket IDs
- Proper nouns: person names, company names, place names that may not appear in training data
- Domain-specific jargon: abbreviations and acronyms specific to your organization
- Mixed content types: documentation, emails, code comments, database records in the same corpus
For a clean single-topic corpus with rich semantic content and no exact-match requirements, pure vector search performs nearly as well as hybrid. In practice, enterprise knowledge bases always have mixed content, so hybrid search is the correct default.
4. The Advanced RAG Pipeline — Visual Overview
Section titled “4. The Advanced RAG Pipeline — Visual Overview”The diagram below shows how query transformation, hybrid retrieval, reranking, and grounded generation compose into a single end-to-end pipeline.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Advanced RAG Pipeline — Query to Response
Query transformation runs first to improve what retrieval sees. Hybrid search broadens recall. Reranking sharpens precision. The LLM gets a high-signal context window.
5. Reranking — Cross-Encoder Precision
Section titled “5. Reranking — Cross-Encoder Precision”A cross-encoder reranker scores each (query, chunk) pair jointly, producing precise relevance scores that bi-encoder vector similarity cannot match — at the cost of 150–300ms additional latency.
Why Bi-Encoder Retrieval is an Approximation
Section titled “Why Bi-Encoder Retrieval is an Approximation”Vector search uses bi-encoders: the query and each document are encoded independently into vectors, and similarity is measured by cosine distance or dot product. This is efficient because document vectors are precomputed once and stored. At query time, only the query needs to be embedded, and similarity is a fast vector operation.
The cost of this efficiency is precision. The bi-encoder never sees the query and document together. It cannot model their direct interaction — phrasing nuances, negations, or conditional relationships. The result: vector similarity scores are directional indicators of relevance, not precise relevance measurements.
Cross-encoders take both the query and the document as a single concatenated input. The model’s attention layers can compute direct relationships between any token in the query and any token in the document. This produces a precise relevance score but requires a separate model inference call for every (query, chunk) pair — which is why cross-encoders are only practical as a reranking step over a small candidate set, not as the primary retrieval mechanism.
Implementing Reranking
Section titled “Implementing Reranking”The most widely used reranking APIs in production are Cohere Rerank and Jina Reranker. Both accept a query and a list of candidate texts and return relevance scores:
import cohere
co = cohere.Client("your-api-key")
def rerank_chunks(query: str, chunks: list[dict], top_n: int = 5) -> list[dict]: """ chunks: list of {"text": str, "metadata": dict} Returns top_n chunks reranked by cross-encoder relevance score. """ response = co.rerank( model="rerank-english-v3.0", query=query, documents=[c["text"] for c in chunks], top_n=top_n, return_documents=True, )
reranked = [] for result in response.results: chunk = chunks[result.index].copy() chunk["relevance_score"] = result.relevance_score reranked.append(chunk)
return reranked
# Full pipeline: hybrid retrieve then rerankraw_chunks = hybrid_search(query, collection="enterprise_docs", top_k=20)final_chunks = rerank_chunks(query, raw_chunks, top_n=5)# final_chunks are what you inject into the promptFor self-hosted reranking (lower latency, no API cost), the cross-encoder/ms-marco-MiniLM-L-6-v2 model from sentence-transformers is a strong open-source option that fits on a single GPU:
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank_local(query: str, chunks: list[str], top_n: int = 5) -> list[tuple[int, float]]: pairs = [(query, chunk) for chunk in chunks] scores = model.predict(pairs) ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True) return ranked[:top_n]Latency Tradeoffs
Section titled “Latency Tradeoffs”Reranking 20 candidates with Cohere Rerank v3 adds approximately 150–300ms to the retrieval pipeline. For a real-time chat interface where total response latency is 2–4 seconds, this is acceptable. For high-throughput batch processing, it may not be. Options for latency-sensitive systems:
- Reduce the candidate set from 20 to 10 before reranking
- Use a lighter self-hosted cross-encoder model
- Cache reranking results for repeated queries (high hit rate for FAQ-style corpora)
- Move to a two-stage approach: fast approximate reranker for <10ms, full reranker only when confidence is low
6. Query Transformation Techniques
Section titled “6. Query Transformation Techniques”Query transformation addresses a fundamental asymmetry: users write questions, but knowledge bases contain answers. The linguistic structure of a question and the linguistic structure of the document that answers it are often very different. Query transformation bridges that gap before retrieval begins.
HyDE — Hypothetical Document Embeddings
Section titled “HyDE — Hypothetical Document Embeddings”HyDE inverts the query direction. Instead of searching the corpus with the user’s question, it asks the LLM to generate a hypothetical answer, then searches with that hypothetical answer’s embedding.
from openai import OpenAI
client = OpenAI()
def hyde_query(question: str) -> str: """Generate a hypothetical answer to improve retrieval recall.""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": ( "Generate a detailed, factual-sounding answer to the question below. " "Write it as if it came from a technical documentation page. " "Do not add disclaimers. If you are uncertain, write a plausible answer anyway." ), }, {"role": "user", "content": question}, ], max_tokens=300, temperature=0.3, ) return response.choices[0].message.content
# HyDE retrieval: embed the hypothetical answer, not the questionhypothetical_answer = hyde_query("How does attention masking work in transformer decoders?")results = vector_search(embed(hypothetical_answer), top_k=20)Why it works: embeddings of answers cluster near other answers in the vector space. A question embedding sits at a different point. By generating a hypothetical answer and embedding that, you effectively query from the “answer cluster” region of the embedding space. This is especially effective for short or ambiguous questions where the question embedding carries little semantic signal.
The cost: one LLM call per query. The hypothetical answer can be generated in parallel with the baseline vector search, so it adds minimal end-to-end latency if parallelized.
Step-Back Prompting
Section titled “Step-Back Prompting”Step-back prompting reformulates a specific question into a broader principle or concept before retrieval. A question about a specific implementation detail may fail to retrieve the conceptual documentation that explains it.
def step_back_query(specific_question: str) -> str: """Reformulate a specific question into a more general principle.""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": ( "Rephrase the specific question below into a more general, " "conceptual question that would retrieve background principles. " "Return only the reformulated question, nothing else." ), }, {"role": "user", "content": specific_question}, ], max_tokens=100, temperature=0.2, ) return response.choices[0].message.content
# Example:# Input: "Why does my LangChain ReAct agent loop 5 times before stopping?"# Output: "How do ReAct agents determine when to stop the reasoning loop?"Step-back prompting works best when users ask highly specific questions about implementation details and the corpus contains conceptual documentation that explains the principles. It is most useful for developer-facing documentation systems.
Multi-Query Retrieval
Section titled “Multi-Query Retrieval”Multi-query expansion generates several alternative phrasings of the same question and runs retrieval for each. Results are merged and deduplicated by chunk ID before reranking. This is one of the highest-leverage recall improvements available:
def generate_query_variants(question: str, n: int = 4) -> list[str]: """Generate n alternative phrasings of the question.""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": ( f"Generate {n} different phrasings of the question below. " "Use different vocabulary and sentence structures. " "Return one phrasing per line, no numbering or bullets." ), }, {"role": "user", "content": question}, ], max_tokens=300, temperature=0.7, ) variants = response.choices[0].message.content.strip().split("\n") return [question] + [v.strip() for v in variants if v.strip()]
def multi_query_retrieve(question: str, top_k_per_query: int = 10) -> list[dict]: """Retrieve with multiple query variants, deduplicate by chunk ID.""" queries = generate_query_variants(question, n=4)
# Run all retrievals in parallel import concurrent.futures with concurrent.futures.ThreadPoolExecutor() as executor: result_sets = list(executor.map( lambda q: hybrid_search(q, collection="docs", top_k=top_k_per_query), queries ))
# Deduplicate by chunk ID, keeping highest relevance score seen_ids = {} for results in result_sets: for chunk in results: chunk_id = chunk.id if chunk_id not in seen_ids: seen_ids[chunk_id] = chunk
return list(seen_ids.values())Multi-query retrieval is parallelizable — all query variants run simultaneously — so the latency cost is one retrieval call’s latency, not N. The main cost is the LLM call to generate variants, typically <500ms for a fast model.
7. Knowledge Graph RAG
Section titled “7. Knowledge Graph RAG”Knowledge graph RAG augments vector retrieval with entity-relationship traversal, enabling multi-hop questions that require connecting facts across multiple documents.
The Multi-Hop Problem
Section titled “The Multi-Hop Problem”Vector retrieval is fundamentally document-centric: it finds chunks that are similar to the query. This works for single-hop questions where the answer lives in one place. It fails for multi-hop questions where the answer requires connecting facts across documents.
“What team owns the service that handles payment processing, and what is their on-call rotation?” requires two pieces of information: which service handles payments (from a service catalog document) and who owns that service’s on-call rotation (from a team roster document). No single chunk contains both. A vector search returns chunks about payments and chunks about on-call rotations, but cannot tell the LLM which team to look up.
A knowledge graph makes the connection explicit. The graph stores entities (services, teams, people) as nodes and relationships (owns, escalates-to, depends-on) as edges. The multi-hop query becomes a graph traversal: payment service → owned-by → team A → on-call-rotation → [names].
How GraphRAG Works
Section titled “How GraphRAG Works”Microsoft’s GraphRAG (released 2024) introduced the pattern that most production implementations now follow:
- Entity extraction: Run the full document corpus through an LLM to extract named entities and their relationships. Store these as a knowledge graph.
- Community summarization: Cluster related entities into communities. Summarize each community using an LLM. Store these summaries.
- Hybrid query routing: For a user query, determine whether it requires local retrieval (specific facts → vector search) or global synthesis (relationships and patterns → graph traversal + community summaries).
- Graph-augmented generation: Combine the retrieved chunks from vector search with the relevant subgraph from the knowledge graph and the relevant community summaries. Pass all three to the LLM.
This approach is particularly powerful for questions like “What are the main themes across our Q3 customer feedback?” or “How are our authentication services interconnected?” — questions that require a global view of the corpus rather than local fact retrieval.
Lightweight KG Integration
Section titled “Lightweight KG Integration”Full GraphRAG implementation is complex. For teams that need multi-hop support without the full graph pipeline, a lighter-weight approach uses metadata relationships explicitly:
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
def graph_augment_query(query: str, initial_chunks: list[dict]) -> list[dict]: """ Given initially retrieved chunks, traverse the knowledge graph to find related entities and retrieve additional connected chunks. """ # Extract entity references from initial chunks entity_ids = extract_entity_ids(initial_chunks) # e.g., ["service:auth-api", "team:platform"]
with driver.session() as session: # Find 1-hop neighbors for each entity result = session.run( """ MATCH (e)-[r]-(neighbor) WHERE e.id IN $entity_ids RETURN e.id AS source, type(r) AS rel, neighbor.id AS target, neighbor.chunk_ids AS chunk_ids LIMIT 50 """, entity_ids=entity_ids, ) neighbor_chunk_ids = [] for record in result: if record["chunk_ids"]: neighbor_chunk_ids.extend(record["chunk_ids"])
# Retrieve the neighbor chunks from the vector store graph_chunks = fetch_chunks_by_id(neighbor_chunk_ids)
# Return initial chunks + graph-connected chunks, deduplicated all_chunk_ids = {c["id"] for c in initial_chunks} additional = [c for c in graph_chunks if c["id"] not in all_chunk_ids] return initial_chunks + additionalThis lightweight approach works well when your entity graph is predefined (service catalog, org chart, product taxonomy) rather than extracted dynamically from the corpus.
8. Production Patterns and Context Window Management
Section titled “8. Production Patterns and Context Window Management”Advanced retrieval improves the quality of what you retrieve. But with multiple query variants, hybrid search, reranking, and graph augmentation all running, you need careful management of what actually goes into the LLM’s context window.
Context Assembly Strategy
Section titled “Context Assembly Strategy”After reranking, you typically have 3–5 high-quality chunks. Before passing them to the LLM:
Order matters. Position the highest-relevance chunk first and last. LLMs have a documented recency bias (attending more to the beginning and end of the context). The “lost-in-the-middle” problem means the middle position receives less attention. With 5 chunks, place ranks 1 and 2 at positions 1 and 5, not positions 2 and 3.
Apply token budgeting. Reserve a fixed token budget for retrieved context (e.g., 4,000 tokens out of a 16,000-token context window). If chunks exceed the budget after reranking, truncate the lowest-ranked chunk first. Never truncate the highest-ranked chunk — it is there because it scored highest.
Include source metadata in context. Format each chunk with its source reference. This enables the LLM to produce citations and allows users to verify answers:
def assemble_rag_prompt(query: str, chunks: list[dict]) -> str: context_parts = [] for i, chunk in enumerate(chunks, 1): source = chunk["metadata"].get("source_url", "internal document") context_parts.append(f"[{i}] Source: {source}\n{chunk['text']}")
context_block = "\n\n---\n\n".join(context_parts)
return f"""You are a helpful assistant. Answer the question using ONLY the context provided below.If the context does not contain sufficient information, say "I don't have enough information to answer this."Always cite your sources by reference number [1], [2], etc.
CONTEXT:{context_block}
QUESTION: {query}
ANSWER:"""Latency Budget for the Full Pipeline
Section titled “Latency Budget for the Full Pipeline”A production advanced RAG pipeline has multiple serial and parallel steps. Understanding the latency breakdown is essential for meeting SLA targets:
| Step | Typical Latency | Parallelizable? |
|---|---|---|
| Query transformation (LLM call) | 300–600ms | Yes (parallel with dense search) |
| Dense vector search | 20–80ms | Yes (parallel with sparse) |
| Sparse BM25 search | 5–20ms | Yes (parallel with dense) |
| RRF fusion | <1ms | No |
| Reranking (20 candidates, API) | 150–300ms | No |
| LLM generation (streaming) | 1–3s (first token) | No |
With parallelization of the query transform and hybrid search steps, end-to-end retrieval (excluding LLM generation) runs in approximately 400–700ms. This fits comfortably within a 3-second total response time with streaming.
For latency-critical applications (real-time voice interfaces, sub-500ms total latency requirements), skip query transformation, use a lightweight self-hosted reranker, and limit candidates to 10 rather than 20.
9. Interview Preparation
Section titled “9. Interview Preparation”Advanced RAG is a high-frequency topic in senior GenAI engineering interviews. The questions below test depth beyond basic RAG knowledge.
Q1: Why is hybrid search better than pure vector search for enterprise corpora?
The complete answer covers three points: (1) dense retrieval misses exact-match terms like identifiers, names, and domain jargon because embedding models compress semantics and lose token-level precision; (2) BM25 handles exact matches but misses semantic synonyms; (3) RRF fusion combines ranked lists without needing score normalization, making it robust to the different score scales of the two systems. Bonus points for explaining when hybrid search does not add value (clean single-topic corpus, no exact-match requirements).
Q2: When would you add reranking and when would you skip it?
Reranking adds value when: the corpus is large enough that the initial retrieval candidates include noisy results (50,000+ chunks), precision matters more than marginal additional latency, and the query distribution is varied. You might skip reranking for: real-time applications with strict latency budgets under 500ms total, small curated corpora where vector similarity is already highly precise, or batch pipelines where all retrieved context is passed to the LLM regardless.
Q3: What is HyDE and why does it improve retrieval?
HyDE addresses the question-answer asymmetry in embedding space: question embeddings and answer embeddings occupy different regions of the vector space. By generating a hypothetical answer and embedding that instead of the question, you query from the answer region. This consistently improves recall for short or ambiguous questions. The tradeoff is one LLM call per query (parallelizable) and sensitivity to hallucination quality in the hypothetical answer — a very wrong hypothetical answer will retrieve from the wrong region.
Q4: How would you add knowledge graph capabilities to an existing RAG system without a full GraphRAG rebuild?
Practical answer: (1) Build a lightweight entity graph from predefined structured data (service catalog, org chart, product taxonomy) rather than LLM-extracted entities — this is faster and more reliable; (2) add entity ID metadata to chunks during indexing; (3) post-retrieval, traverse one hop from entities found in the initial results to fetch adjacent chunks; (4) add these graph-adjacent chunks to the candidate set before reranking. This approach gives multi-hop capability for structured relationships without the complexity of full graph extraction.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”Advanced RAG is not a single technique — it is a set of targeted improvements, each addressing a specific failure mode in basic retrieval. The right combination depends on your corpus characteristics, quality requirements, and latency constraints.
Decision Framework
Section titled “Decision Framework”| Failure Mode | Symptom | Solution |
|---|---|---|
| Recall: exact terms missed | Retrieval misses product codes, names, IDs | Hybrid search (BM25 + dense) |
| Recall: query-document mismatch | Right document exists but not retrieved | Query transformation (HyDE or multi-query) |
| Precision: noisy top-k | LLM gets loosely related context | Cross-encoder reranking |
| Multi-hop: single chunk insufficient | Multi-step questions return incomplete answers | Knowledge graph augmentation |
| Context: LLM ignores middle chunks | Answers miss key facts in context | Position-aware context assembly |
Implementation Sequence
Section titled “Implementation Sequence”Start with hybrid search — it requires no additional model calls and delivers the largest recall improvement per unit of complexity. Add reranking next for corpora over 50,000 chunks. Implement query transformation when you have evaluation data showing recall gaps on specific question types. Add knowledge graph support when multi-hop questions are a recurring user need.
Measure impact at each step using RAGAS metrics — context recall and context precision are the metrics most sensitive to retrieval improvements. Do not add complexity without validating that the metrics move in the right direction on your actual query distribution.
The full advanced RAG pipeline represents a significant engineering investment. Build it incrementally, gate each addition on measured improvement, and keep the architecture observable so you can see exactly where in the pipeline quality is lost.
Related Pages
Section titled “Related Pages”- RAG Architecture — Production RAG Systems — Foundation patterns: chunking, basic retrieval, and the two-pipeline architecture
- Embeddings for Engineers — How embedding models work and how to choose between them
- Vector Database Comparison — Pinecone, Weaviate, Qdrant, and Chroma for hybrid search support
- LLM Evaluation Guide — RAGAS metrics for measuring retrieval quality improvements
- Context Windows and Token Management — How to manage context assembly for large retrieval result sets
Last updated: March 2026. API interfaces for Qdrant, Cohere Rerank, and LangChain evolve regularly — verify against current documentation before implementing.
Frequently Asked Questions
What is hybrid search in RAG?
Hybrid search in RAG combines dense vector retrieval (semantic similarity via embeddings) with sparse keyword retrieval (BM25 or TF-IDF). The two result sets are merged using Reciprocal Rank Fusion (RRF). Dense search handles paraphrasing and synonyms; sparse search handles exact matches like product codes, names, and technical identifiers. Most production RAG systems use hybrid search because it significantly improves recall over either method alone.
What is reranking in RAG?
Reranking is a second-pass relevance scoring step applied after initial retrieval. A cross-encoder model reads both the user query and each retrieved chunk as a single input, producing a precise relevance score for each pair. The top 3-5 highest-scoring chunks are passed to the LLM. Reranking improves precision at the cost of additional latency, typically 150-300ms for 20 candidates.
What is knowledge graph RAG?
Knowledge graph RAG (also called GraphRAG) augments vector retrieval with a structured knowledge graph that represents entities and their relationships. When a user asks a multi-hop question requiring connecting two or more related facts, pure vector search often fails because no single chunk contains the full answer. The knowledge graph traversal finds the connecting path between entities, and those connected facts are included in the LLM's context alongside traditional retrieved chunks.
What is query transformation in RAG?
Query transformation rewrites or expands the user's original query before retrieval to improve recall. Three main techniques exist: HyDE generates a hypothetical answer and embeds that instead of the raw question; step-back prompting reformulates specific questions into broader principles; multi-query generation produces 3-5 alternative phrasings and retrieves for each, then merges and deduplicates results before reranking.
What is HyDE and how does it improve RAG retrieval?
HyDE (Hypothetical Document Embeddings) addresses the question-answer asymmetry in embedding space. Instead of searching the corpus with the user's question, HyDE asks the LLM to generate a hypothetical answer, then embeds that answer for retrieval. This works because embeddings of answers cluster near other answers in vector space, while question embeddings sit at a different point. The cost is one additional LLM call per query, which can be parallelized with the baseline vector search.
What is Reciprocal Rank Fusion in RAG?
Reciprocal Rank Fusion (RRF) is the merging algorithm used in hybrid search to combine results from dense vector search and sparse BM25 search. The formula scores each document as the sum of 1/(k + rank) across each ranked list, where k is a smoothing constant (typically 60). RRF is robust to score scale differences between dense and sparse retrievers because it only considers rank position, not raw scores.
What is the difference between basic RAG and advanced RAG?
Basic RAG embeds the user query, runs a vector similarity search, takes the top-k chunks, and injects them into the prompt. Advanced RAG adds multiple layers to fix production failures: hybrid search to catch exact-match terms that pure vector search misses, cross-encoder reranking to eliminate noisy results, query transformation to bridge the question-answer mismatch, and knowledge graph traversal to handle multi-hop questions requiring facts from multiple documents.
How much latency does an advanced RAG pipeline add?
With parallelization of query transformation and hybrid search steps, end-to-end retrieval excluding LLM generation runs in approximately 400-700ms. Query transformation adds 300-600ms but runs in parallel with dense search. Dense vector search takes 20-80ms, sparse BM25 search takes 5-20ms, and cross-encoder reranking of 20 candidates adds 150-300ms. This fits within a 3-second total response time with streaming.
When should I add reranking to my RAG pipeline?
Add reranking when your corpus exceeds 50,000 chunks and the initial retrieval candidates include noisy results. Reranking adds the most value when precision matters more than marginal additional latency and when the query distribution is varied. Skip reranking for real-time applications with strict latency budgets under 500ms, small curated corpora where vector similarity is already precise, or batch pipelines where all retrieved context is passed to the LLM.
What is the lost-in-the-middle problem in RAG?
The lost-in-the-middle problem refers to LLMs attending more to the beginning and end of their context window while giving less attention to chunks in the middle. To mitigate this in RAG, position the highest-relevance chunks at the first and last positions in the context. With 5 reranked chunks, place ranks 1 and 2 at positions 1 and 5 rather than positions 2 and 3.