Skip to content

RAG Pipeline Tutorial — Build a Working RAG System in Python (2026)

This tutorial walks you through building a complete RAG pipeline in Python — from loading raw documents to generating grounded answers. Every code block runs. By the end, you will have a working system that answers questions about your own documents.

Prerequisites: Python 3.10+, an OpenAI API key, and basic familiarity with Python. No prior RAG experience required.

Frameworks like LangChain and LlamaIndex abstract the RAG pipeline into a few function calls. That abstraction is useful in production — but it hides the mechanics. When retrieval quality is poor, when answers hallucinate, when latency is unacceptable, you need to understand what is happening at each stage to diagnose and fix the problem.

Building a RAG pipeline from scratch teaches you:

  • What each component does — not conceptually, but in working code you can step through
  • Where quality bottlenecks live — chunking decisions, embedding choices, retrieval parameters
  • How to debug retrieval failures — why a relevant document was not retrieved, why an irrelevant one was
  • What frameworks abstract away — so you know when the abstraction helps and when it hinders

This is not an exercise in reinventing the wheel. It is an exercise in understanding the wheel so you can build better vehicles. After completing this tutorial, framework-based RAG code will make significantly more sense because you will have implemented every layer yourself.

The entire pipeline fits in approximately 100 lines of Python. No framework magic. No hidden complexity.


The RAG pipeline you build in this tutorial has six stages. Each stage is a discrete function you can inspect, modify, and test independently.

Pipeline overview:

1. Load Documents → Read PDF or text files into raw strings
2. Chunk Text → Split documents into overlapping segments (500 tokens, 50-token overlap)
3. Generate Embeddings → Convert each chunk into a vector using OpenAI's embedding model
4. Store in Vector DB → Index vectors in ChromaDB for fast similarity search
5. Retrieve Chunks → Given a user query, find the most relevant chunks
6. Generate Answer → Pass retrieved chunks + query to an LLM for a grounded response

What you will need:

Terminal window
pip install openai chromadb PyPDF2 tiktoken
  • openai — embedding generation and LLM calls
  • chromadb — local vector database (no server required)
  • PyPDF2 — PDF document loading
  • tiktoken — accurate token counting for chunk sizing

Set your API key as an environment variable before running any code:

Terminal window
export OPENAI_API_KEY="your-api-key-here"

The total cost to run this tutorial is under $0.05 in OpenAI API calls. Embedding a few documents and making a handful of LLM calls uses minimal tokens.


Before writing code, understand the data flow. A RAG pipeline has two phases: an offline indexing phase (steps 1–4, run once per document set) and an online query phase (steps 5–6, run on every user question).

RAG Pipeline — End-to-End Data Flow

The indexing phase runs once per document set. The query phase runs on every user question. Both connect through the vector store.

Indexing Phase
Run once when documents change
Load Documents
Chunk Text
Generate Embeddings
Store in Vector DB
Query Phase
Run on every user question
Embed User Query
Similarity Search
Retrieve Top-K Chunks
Assemble Context
Generation Phase
LLM produces grounded answer
Build Prompt
Inject Retrieved Context
LLM Generation
Return Answer
Idle

The indexing phase transforms raw documents into searchable vectors. The query phase converts the user’s question into a vector, finds the closest document chunks, and passes them as context to the LLM. The LLM generates an answer grounded in the retrieved content — not from its training data alone.


This is the core of the tutorial. Each step is a standalone function with a clear input and output. Together, they form the complete pipeline.

Load text from PDF files or plain text files. This function handles both formats and returns a list of document strings.

from PyPDF2 import PdfReader
from pathlib import Path
def load_documents(file_paths: list[str]) -> list[dict]:
"""Load documents from PDF or text files.
Returns a list of dicts with 'text' and 'source' keys.
"""
documents = []
for path in file_paths:
file_path = Path(path)
if file_path.suffix == ".pdf":
reader = PdfReader(str(file_path))
text = "\n".join(
page.extract_text() or "" for page in reader.pages
)
elif file_path.suffix in (".txt", ".md"):
text = file_path.read_text(encoding="utf-8")
else:
print(f"Skipping unsupported file: {path}")
continue
if text.strip():
documents.append({"text": text, "source": str(file_path)})
return documents

Key decisions:

  • Each document carries its source metadata — this enables citation in the final answer
  • Empty pages are skipped (common in scanned PDFs with OCR failures)
  • The function is format-agnostic — add more file types by extending the if/elif chain

Chunking splits documents into segments small enough for embedding but large enough to preserve context. Overlap ensures no information is lost at chunk boundaries.

import tiktoken
def chunk_text(
text: str,
source: str,
chunk_size: int = 500,
overlap: int = 50,
) -> list[dict]:
"""Split text into overlapping chunks based on token count.
Returns a list of dicts with 'text', 'source', and 'chunk_index' keys.
"""
encoder = tiktoken.encoding_for_model("gpt-4o-mini")
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
chunks.append({
"text": chunk_text,
"source": source,
"chunk_index": len(chunks),
})
# Move forward by chunk_size minus overlap
start += chunk_size - overlap
return chunks
def chunk_documents(documents: list[dict], **kwargs) -> list[dict]:
"""Chunk all documents and return a flat list of chunks."""
all_chunks = []
for doc in documents:
doc_chunks = chunk_text(doc["text"], doc["source"], **kwargs)
all_chunks.extend(doc_chunks)
return all_chunks

Why 500 tokens with 50-token overlap?

  • 500 tokens is large enough to contain a complete paragraph or concept, but small enough that the embedding vector captures a focused topic. Larger chunks dilute the relevance signal. Smaller chunks lose context.
  • 50-token overlap ensures that a sentence split across two chunks appears in full in at least one of them. Without overlap, important information at chunk boundaries is effectively invisible to retrieval.

These defaults work well for most documents. Adjust them based on your specific content — shorter for FAQ-style documents, longer for dense technical writing.

Convert each text chunk into a numerical vector using OpenAI’s embedding model. These vectors capture semantic meaning — chunks about similar topics produce similar vectors.

from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY env var
def generate_embeddings(
chunks: list[dict],
model: str = "text-embedding-3-small",
) -> list[dict]:
"""Add an 'embedding' key to each chunk dict.
Sends chunks in batches to the OpenAI embedding API.
"""
batch_size = 100 # API supports up to 2048 inputs per call
texts = [chunk["text"] for chunk in chunks]
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
response = client.embeddings.create(
model=model,
input=batch,
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
# Attach embeddings to chunk dicts
for chunk, embedding in zip(chunks, all_embeddings):
chunk["embedding"] = embedding
return chunks

Why text-embedding-3-small?

  • It produces 1536-dimensional vectors with strong semantic quality
  • It costs $0.02 per million tokens — embedding an entire book costs pennies
  • For higher quality at 2x the cost, use text-embedding-3-large (3072 dimensions)
  • For zero cost, use a local model like all-MiniLM-L6-v2 via sentence-transformers (384 dimensions, runs on CPU)

ChromaDB is an open-source vector database that runs locally — no server setup, no cloud account. It stores vectors alongside metadata and supports fast similarity search.

import chromadb
def create_vector_store(
chunks: list[dict],
collection_name: str = "rag_tutorial",
) -> chromadb.Collection:
"""Store chunks with embeddings in a ChromaDB collection.
Returns the collection for later querying.
"""
chroma_client = chromadb.Client() # In-memory — use PersistentClient for disk
# Delete existing collection if it exists (clean re-index)
try:
chroma_client.delete_collection(collection_name)
except ValueError:
pass
collection = chroma_client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}, # Cosine similarity
)
collection.add(
ids=[f"chunk_{i}" for i in range(len(chunks))],
embeddings=[chunk["embedding"] for chunk in chunks],
documents=[chunk["text"] for chunk in chunks],
metadatas=[
{"source": chunk["source"], "chunk_index": chunk["chunk_index"]}
for chunk in chunks
],
)
return collection

Key configuration:

  • hnsw:space: "cosine" — cosine similarity is the standard distance metric for OpenAI embeddings. Documents with similar meaning produce vectors that point in similar directions.
  • chromadb.Client() — in-memory storage. For persistence across sessions, use chromadb.PersistentClient(path="./chroma_db").
  • Metadata is stored alongside vectors — this enables filtering by source, date, or any custom field during retrieval.

Given a user query, embed it with the same model and find the closest chunks by cosine similarity.

def retrieve(
query: str,
collection: chromadb.Collection,
n_results: int = 5,
) -> list[dict]:
"""Retrieve the most relevant chunks for a query.
Returns chunks ranked by cosine similarity to the query embedding.
"""
# Embed the query using the same model
query_response = client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
query_embedding = query_response.data[0].embedding
# Search the vector store
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
)
# Format results
retrieved_chunks = []
for i in range(len(results["documents"][0])):
retrieved_chunks.append({
"text": results["documents"][0][i],
"source": results["metadatas"][0][i]["source"],
"distance": results["distances"][0][i],
})
return retrieved_chunks

Why n_results=5?

  • Fewer results (1–2) risk missing relevant context. More results (10+) add noise that can confuse the LLM.
  • 5 is a practical default. After building the pipeline, tune this number based on your specific documents and queries.
  • The distance field tells you how similar each chunk is — useful for debugging and setting relevance thresholds.

Assemble the retrieved chunks into a prompt and send it to the LLM. The system prompt instructs the model to answer only from the provided context.

def generate_answer(
query: str,
retrieved_chunks: list[dict],
model: str = "gpt-4o-mini",
) -> str:
"""Generate an answer grounded in the retrieved context.
The LLM is instructed to use only the provided context.
"""
# Build context from retrieved chunks
context_parts = []
for i, chunk in enumerate(retrieved_chunks, 1):
context_parts.append(
f"[Source {i}: {chunk['source']}]\n{chunk['text']}"
)
context = "\n\n---\n\n".join(context_parts)
system_prompt = (
"You are a helpful assistant that answers questions based on "
"the provided context. Use only the information in the context "
"to answer. If the context does not contain enough information "
"to answer the question, say so explicitly. Cite the source "
"number (e.g., [Source 1]) when using information from a "
"specific passage."
)
user_prompt = (
f"Context:\n{context}\n\n---\n\n"
f"Question: {query}\n\n"
f"Answer based on the context above:"
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0,
)
return response.choices[0].message.content

Key prompt engineering decisions:

  • temperature=0 — deterministic output. For RAG, you want the LLM to faithfully report what is in the context, not creatively interpret it.
  • Source citation instruction — the LLM is told to cite [Source N], linking answers back to specific retrieved chunks. This is basic grounding.
  • Explicit refusal instruction — “if the context does not contain enough information, say so” prevents the LLM from hallucinating when retrieval fails.

Here is the complete pipeline in a single runnable script:

def run_rag_pipeline(
file_paths: list[str],
query: str,
) -> str:
"""Run the complete RAG pipeline end to end."""
# Phase 1: Indexing
print("Loading documents...")
documents = load_documents(file_paths)
print(f"Loaded {len(documents)} documents")
print("Chunking...")
chunks = chunk_documents(documents)
print(f"Created {len(chunks)} chunks")
print("Generating embeddings...")
chunks = generate_embeddings(chunks)
print("Embeddings generated")
print("Storing in vector database...")
collection = create_vector_store(chunks)
print(f"Stored {collection.count()} chunks in ChromaDB")
# Phase 2: Query
print(f"\nQuery: {query}")
print("Retrieving relevant chunks...")
retrieved = retrieve(query, collection)
print(f"Retrieved {len(retrieved)} chunks")
print("Generating answer...")
answer = generate_answer(query, retrieved)
return answer
# Run it
if __name__ == "__main__":
answer = run_rag_pipeline(
file_paths=["your-document.pdf"],
query="What are the main points of this document?",
)
print(f"\nAnswer:\n{answer}")

Run this against any PDF or text file on your machine. Replace the file path and query with your own. The output includes the generated answer with source citations.


Each layer of the RAG pipeline has a specific responsibility. Understanding this stack helps you swap components — replace ChromaDB with Pinecone, replace OpenAI embeddings with a local model, or add a reranking layer.

RAG Pipeline Component Stack

Each layer can be swapped independently. The interfaces between layers stay the same.

LLM Generation Layer
GPT-4o-mini, Claude, Llama 3 -- produces grounded answers from context
Prompt Assembly Layer
System prompt + retrieved chunks + user query -- controls LLM behavior
Retrieval Layer
Vector similarity search -- finds the most relevant chunks for a query
Vector Storage Layer
ChromaDB, Pinecone, Qdrant -- indexes and serves embedding vectors
Embedding Layer
text-embedding-3-small, all-MiniLM-L6-v2 -- converts text to vectors
Document Processing Layer
Loading, chunking, metadata extraction -- prepares raw documents
Idle

The stack diagram shows why RAG is modular. You can upgrade any layer without rewriting the others. Want better retrieval? Add a reranking layer between retrieval and prompt assembly. Want lower cost? Swap OpenAI embeddings for a local model. Want persistence? Replace the in-memory ChromaDB client with a persistent one.


The basic pipeline works, but production RAG systems add three enhancements that measurably improve answer quality.

Vector search finds semantically similar content but misses exact keyword matches: product names, error codes, version numbers. Hybrid search combines vector similarity with keyword matching.

Approach: Run vector search to get candidates, then re-score by combining cosine similarity (70% weight) with keyword overlap (30% weight). ChromaDB supports where_document filters for basic keyword matching alongside vector search. For full BM25 keyword scoring, add Elasticsearch or use Weaviate which supports both natively.

# Core idea — combine scores from both retrieval methods
combined_score = 0.7 * vector_similarity + 0.3 * keyword_overlap

Initial retrieval ranks chunks by embedding similarity — a rough approximation. A cross-encoder reranker takes each (query, chunk) pair and produces a precise relevance score. It is more accurate but slower, so it runs only on the top 10–20 candidates from the initial retrieval.

Approach: Retrieve a larger set (e.g., 20 chunks), then rerank with a cross-encoder model like cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers. The reranker processes each (query, chunk) pair and outputs a relevance score. Keep the top 3–5 chunks for the LLM. For a quick prototype, use the LLM itself as a reranker by asking it to rate relevance on a 1–10 scale — but this is slow and expensive at scale.

A single query phrasing may miss relevant chunks when the user’s vocabulary differs from the document’s language. Multi-query retrieval generates 3–5 alternative phrasings of the question (using the LLM), retrieves against each, and merges results by deduplicating on chunk text.

Approach: Send the original query to the LLM with a prompt like “Generate 3 alternative phrasings of this question.” Retrieve top-3 for each variant. Deduplicate by chunk content. This broadens recall without sacrificing precision.

These three enhancements address the three most common RAG failure modes: keyword mismatch, imprecise ranking, and query-document vocabulary gap. Apply them incrementally based on where your pipeline underperforms.


The pipeline you built in this tutorial is custom RAG — every component is explicit code you control. Framework RAG (LangChain, LlamaIndex) abstracts these components behind high-level APIs. Both approaches have clear trade-offs.

Framework RAG vs Custom RAG

Framework RAG
LangChain, LlamaIndex — high-level abstractions
  • Build a working RAG pipeline in 10 lines of code
  • Pre-built integrations for 50+ vector stores and LLMs
  • Community recipes for common patterns like hybrid search
  • Abstraction hides retrieval details — hard to debug quality issues
  • Framework updates can break your pipeline silently
  • Performance overhead from abstraction layers adds latency
  • Lock-in to framework-specific patterns and data structures
VS
Custom RAG
Direct API calls — explicit control over every component
  • Full visibility into every step — easy to debug retrieval failures
  • No framework overhead — minimal latency from your code
  • Swap any component without framework migration
  • Deep understanding of RAG internals for interviews and design
  • More code to write and maintain for each integration
  • You build common patterns (retry, batching) yourself
  • No community ecosystem for pre-built retrievers or chains
Verdict: Build custom first to understand the internals. Move to a framework when you need its integrations. Most production teams use frameworks for orchestration but customize the retrieval and generation layers.
Use Framework RAG when…
Rapid prototyping, multi-provider pipelines, teams already using the framework ecosystem
Use Custom RAG when…
Learning RAG, performance-critical systems, teams that need full control over retrieval quality

The recommendation: start custom (which you just did), then adopt a framework when the integration benefits outweigh the abstraction costs. Many production systems are hybrid — framework orchestration with custom retrieval logic.


RAG pipeline implementation is one of the most frequently tested topics in GenAI engineering interviews. These questions test whether you can build, not just describe.

Q1: Walk me through building a RAG pipeline from scratch.

Section titled “Q1: Walk me through building a RAG pipeline from scratch.”

What interviewers want: A clear, ordered explanation of the six stages — load, chunk, embed, store, retrieve, generate — with specific technology choices and justification for each.

Strong answer structure: Start with the two-phase architecture (offline indexing, online retrieval). For each stage, name the specific tool (PyPDF2, tiktoken, text-embedding-3-small, ChromaDB), explain the key configuration choice (chunk size 500, overlap 50, cosine similarity, top-5 retrieval, temperature 0), and state why that choice matters. End with the system prompt design — how you instruct the LLM to stay grounded and cite sources.

Q2: Your RAG pipeline retrieves chunks but the answer is wrong. How do you debug it?

Section titled “Q2: Your RAG pipeline retrieves chunks but the answer is wrong. How do you debug it?”

What interviewers want: A systematic debugging approach, not guesswork.

Strong answer: Inspect the retrieved chunks first. If the relevant information is not in the retrieved chunks, the problem is retrieval — adjust chunk size, overlap, or embedding model. If the relevant information is in the chunks but the answer ignores it, the problem is generation — adjust the system prompt, reduce the number of chunks (noise), or lower temperature. If the information is partially in multiple chunks that were not all retrieved, the problem is fragmentation — increase overlap or switch to hierarchical chunking.

Q3: How would you add hybrid search to this pipeline?

Section titled “Q3: How would you add hybrid search to this pipeline?”

What interviewers want: Understanding that vector search alone has failure modes, and practical knowledge of how to combine semantic and keyword matching.

Strong answer: Vector search excels at semantic matching but misses exact terms — product names, error codes, acronyms. Add BM25 or keyword filtering alongside vector search. Merge results using reciprocal rank fusion: for each document, combine its rank from both methods. This captures both semantic relevance and keyword precision. ChromaDB supports basic keyword filtering via where_document; for full BM25, add Elasticsearch or use a database that supports both natively like Weaviate.

Q4: What happens if your documents change frequently? How do you keep the RAG pipeline current?

Section titled “Q4: What happens if your documents change frequently? How do you keep the RAG pipeline current?”

What interviewers want: Awareness of the operational challenge of maintaining a RAG system beyond the initial build.

Strong answer: Track document versions with content hashes. When a document changes, re-chunk and re-embed only the changed document — do not re-index the entire corpus. Use ChromaDB’s upsert to replace stale chunks. For large-scale systems, build an incremental indexing pipeline that watches for file changes (or database update timestamps) and processes only deltas. Set up monitoring to track embedding freshness — the gap between the most recent document update and the most recent re-index.


The pipeline you built works for learning and prototyping. Moving it to production requires attention to three areas: scaling, evaluation, and monitoring.

The in-memory ChromaDB client works for thousands of chunks. For millions, switch to a managed vector database:

ScaleRecommended Approach
<10K chunksChromaDB in-memory or persistent local
10K–1M chunksChromaDB persistent, Qdrant local, or FAISS
1M+ chunksPinecone, Qdrant Cloud, or Weaviate Cloud

Embedding generation is the slowest part of indexing. For large document sets, use batch embedding with the approach shown in Step 3 (100 chunks per API call). For very large corpora (>100K documents), consider local embedding models to eliminate API costs and rate limits.

RAG evaluation measures four dimensions. Without measurement, you cannot systematically improve quality.

MetricWhat It MeasuresHow to Compute
Context PrecisionAre the retrieved chunks relevant to the query?LLM-as-judge scores each chunk’s relevance
Context RecallDid retrieval find all relevant chunks?Compare retrieved chunks against a labeled ground truth set
FaithfulnessDoes the answer stay grounded in the retrieved context?LLM-as-judge checks each claim against the provided chunks
Answer RelevancyDoes the answer address the original question?LLM-as-judge scores answer-question alignment

Start with a test set of 20–50 (question, expected answer, source document) triples. Run your pipeline on each question. Score each metric. This gives you a baseline. Every change to chunking, retrieval, or prompting should improve at least one metric without degrading the others.

Production RAG systems need three types of monitoring:

  1. Retrieval quality — log the query, retrieved chunks, and distances for every request. Flag queries where the top chunk’s similarity score is below a threshold (e.g., cosine distance >0.5) — these indicate retrieval failures.
  2. Latency — measure embedding time, search time, and generation time separately. Total latency should be under 3 seconds for a good user experience. If embedding dominates, batch or cache queries. If generation dominates, use a faster model or enable streaming.
  3. Index freshness — track the timestamp of the most recently indexed document. If documents are updated daily but your index is a week old, users get stale answers.

You built a complete RAG pipeline in Python: six functions, no framework dependencies, and a working system that answers questions about your documents. The pipeline loads documents, chunks them with overlap, generates embeddings, stores vectors in ChromaDB, retrieves relevant chunks by similarity, and generates grounded answers.

Key takeaways:

  • RAG has two phases: offline indexing (load, chunk, embed, store) and online query (retrieve, generate)
  • Chunking parameters (size and overlap) have more impact on quality than most engineers expect
  • The system prompt is the primary control for grounding — it tells the LLM to use only the retrieved context
  • Hybrid search, reranking, and multi-query retrieval address the three most common retrieval failure modes
  • Evaluation requires a test set and four metrics: precision, recall, faithfulness, and relevancy

Last updated: March 2026. Code examples use OpenAI Python SDK v1.x, ChromaDB v0.5.x, and PyPDF2 v3.x. The pipeline architecture applies regardless of which specific libraries or models you choose.

Frequently Asked Questions

What libraries do I need to build a RAG pipeline in Python?

A minimal RAG pipeline needs four libraries: openai for embeddings and LLM generation, chromadb for vector storage, PyPDF2 for document loading, and tiktoken for token counting. Install with pip install openai chromadb PyPDF2 tiktoken. For production systems, you may add LangChain for orchestration or sentence-transformers for local embedding models.

How do I choose the right chunk size for my RAG pipeline?

Start with 500-token chunks and 50-token overlap. Smaller chunks (200-300 tokens) improve retrieval precision but lose surrounding context. Larger chunks (800-1000 tokens) preserve context but dilute relevance signals. Test with your actual documents and queries — measure retrieval quality by checking whether the returned chunks contain the answer to your test questions.

Why does my RAG pipeline return irrelevant results?

The most common causes are: chunks that are too large (diluting the relevance signal), no overlap between chunks (splitting key information across boundaries), poor query formulation (the query phrasing does not match the document language), or insufficient top-k results. Try reducing chunk size, increasing overlap, rephrasing queries, or adding hybrid search with keyword matching alongside vector search.

Can I build a RAG pipeline without OpenAI?

Yes. Replace OpenAI embeddings with sentence-transformers running locally (e.g., all-MiniLM-L6-v2) and replace the OpenAI LLM with Ollama running a local model like Llama 3 or Mistral. ChromaDB works with any embedding model. The pipeline architecture stays identical — only the embedding and generation functions change.

How is this tutorial different from the RAG architecture guide?

The RAG architecture guide covers design concepts: what RAG is, architectural patterns, chunking strategies, hybrid search theory, and production failure modes. This tutorial is a hands-on Python implementation. You write and run every line of code: loading documents, chunking text, generating embeddings, storing vectors in ChromaDB, retrieving relevant chunks, and generating answers. Start here to build, then read the architecture guide to design production systems.