RAG Pipeline Tutorial — Build a Working RAG System in Python (2026)

Q: What libraries do I need to build a RAG pipeline in Python?

A minimal RAG pipeline needs four libraries: openai for embeddings and LLM generation, chromadb for vector storage, PyPDF2 or pdfplumber for document loading, and tiktoken for token counting. Install them with pip install openai chromadb PyPDF2 tiktoken. For production systems, you may add langchain for orchestration and sentence-transformers for local embedding models.

Q: Can I build a RAG pipeline without OpenAI?

Yes. Replace OpenAI embeddings with sentence-transformers running locally (e.g., all-MiniLM-L6-v2) and replace the OpenAI LLM with Ollama running a local model like Llama 3 or Mistral. ChromaDB works with any embedding model. The pipeline architecture stays identical — only the embedding and generation functions change. Local models are free but require GPU for reasonable performance.

Q: How is this tutorial different from the RAG architecture guide?

The RAG architecture guide at /genai-engineer/rag/ covers design concepts: what RAG is, architectural patterns, chunking strategies, hybrid search theory, and production failure modes. This tutorial is a hands-on Python implementation. You write and run every line of code: loading documents, chunking text, generating embeddings, storing vectors in ChromaDB, retrieving relevant chunks, and generating answers. Start here to build, then read the architecture guide to design production systems.

This tutorial walks you through building a complete RAG pipeline in Python — from loading raw documents to generating grounded answers. Every code block runs. By the end, you will have a working system that answers questions about your own documents.

Prerequisites: Python 3.10+, an OpenAI API key, and basic familiarity with Python. No prior RAG experience required.

1. Why Build a RAG Pipeline from Scratch

Frameworks like LangChain and LlamaIndex abstract the RAG pipeline into a few function calls. That abstraction is useful in production — but it hides the mechanics. When retrieval quality is poor, when answers hallucinate, when latency is unacceptable, you need to understand what is happening at each stage to diagnose and fix the problem.

Building a RAG pipeline from scratch teaches you:

What each component does — not conceptually, but in working code you can step through
Where quality bottlenecks live — chunking decisions, embedding choices, retrieval parameters
How to debug retrieval failures — why a relevant document was not retrieved, why an irrelevant one was
What frameworks abstract away — so you know when the abstraction helps and when it hinders

This is not an exercise in reinventing the wheel. It is an exercise in understanding the wheel so you can build better vehicles. After completing this tutorial, framework-based RAG code will make significantly more sense because you will have implemented every layer yourself.

The entire pipeline fits in approximately 100 lines of Python. No framework magic. No hidden complexity.

2. What You Will Build

The RAG pipeline you build in this tutorial has six stages. Each stage is a discrete function you can inspect, modify, and test independently.

Pipeline overview:

1. Load Documents    → Read PDF or text files into raw strings
2. Chunk Text        → Split documents into overlapping segments (500 tokens, 50-token overlap)
3. Generate Embeddings → Convert each chunk into a vector using OpenAI's embedding model
4. Store in Vector DB  → Index vectors in ChromaDB for fast similarity search
5. Retrieve Chunks     → Given a user query, find the most relevant chunks
6. Generate Answer     → Pass retrieved chunks + query to an LLM for a grounded response

What you will need:

pip install openai chromadb PyPDF2 tiktoken

openai — embedding generation and LLM calls
chromadb — local vector database (no server required)
PyPDF2 — PDF document loading
tiktoken — accurate token counting for chunk sizing

Set your API key as an environment variable before running any code:

export OPENAI_API_KEY="your-api-key-here"

The total cost to run this tutorial is under $0.05 in OpenAI API calls. Embedding a few documents and making a handful of LLM calls uses minimal tokens.

3. RAG Pipeline Architecture

Before writing code, understand the data flow. A RAG pipeline has two phases: an offline indexing phase (steps 1–4, run once per document set) and an online query phase (steps 5–6, run on every user question).

Data Flow Diagram

RAG Pipeline — End-to-End Data Flow

The indexing phase runs once per document set. The query phase runs on every user question. Both connect through the vector store.

Indexing Phase

Run once when documents change

Load Documents

Chunk Text

Generate Embeddings

Store in Vector DB

Query Phase

Run on every user question

Embed User Query

Similarity Search

Retrieve Top-K Chunks

Assemble Context

Generation Phase

LLM produces grounded answer

Build Prompt

Inject Retrieved Context

LLM Generation

Return Answer

Idle

The indexing phase transforms raw documents into searchable vectors. The query phase converts the user’s question into a vector, finds the closest document chunks, and passes them as context to the LLM. The LLM generates an answer grounded in the retrieved content — not from its training data alone.

4. Build the RAG Pipeline Step by Step

This is the core of the tutorial. Each step is a standalone function with a clear input and output. Together, they form the complete pipeline.

Step 1: Load Documents

Load text from PDF files or plain text files. This function handles both formats and returns a list of document strings.

from PyPDF2 import PdfReader
from pathlib import Path

def load_documents(file_paths: list[str]) -> list[dict]:
    """Load documents from PDF or text files.

    Returns a list of dicts with 'text' and 'source' keys.
    """
    documents = []
    for path in file_paths:
        file_path = Path(path)
        if file_path.suffix == ".pdf":
            reader = PdfReader(str(file_path))
            text = "\n".join(
                page.extract_text() or "" for page in reader.pages
            )
        elif file_path.suffix in (".txt", ".md"):
            text = file_path.read_text(encoding="utf-8")
        else:
            print(f"Skipping unsupported file: {path}")
            continue

        if text.strip():
            documents.append({"text": text, "source": str(file_path)})

    return documents

Key decisions:

Each document carries its source metadata — this enables citation in the final answer
Empty pages are skipped (common in scanned PDFs with OCR failures)
The function is format-agnostic — add more file types by extending the if/elif chain

Step 2: Chunk Text with Overlap

Chunking splits documents into segments small enough for embedding but large enough to preserve context. Overlap ensures no information is lost at chunk boundaries.

import tiktoken

def chunk_text(
    text: str,
    source: str,
    chunk_size: int = 500,
    overlap: int = 50,
) -> list[dict]:
    """Split text into overlapping chunks based on token count.

    Returns a list of dicts with 'text', 'source', and 'chunk_index' keys.
    """
    encoder = tiktoken.encoding_for_model("gpt-4o-mini")
    tokens = encoder.encode(text)
    chunks = []
    start = 0

    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)

        chunks.append({
            "text": chunk_text,
            "source": source,
            "chunk_index": len(chunks),
        })

        # Move forward by chunk_size minus overlap
        start += chunk_size - overlap

    return chunks

def chunk_documents(documents: list[dict], **kwargs) -> list[dict]:
    """Chunk all documents and return a flat list of chunks."""
    all_chunks = []
    for doc in documents:
        doc_chunks = chunk_text(doc["text"], doc["source"], **kwargs)
        all_chunks.extend(doc_chunks)
    return all_chunks

Why 500 tokens with 50-token overlap?

500 tokens is large enough to contain a complete paragraph or concept, but small enough that the embedding vector captures a focused topic. Larger chunks dilute the relevance signal. Smaller chunks lose context.
50-token overlap ensures that a sentence split across two chunks appears in full in at least one of them. Without overlap, important information at chunk boundaries is effectively invisible to retrieval.

These defaults work well for most documents. Adjust them based on your specific content — shorter for FAQ-style documents, longer for dense technical writing.

Step 3: Generate Embeddings

Convert each text chunk into a numerical vector using OpenAI’s embedding model. These vectors capture semantic meaning — chunks about similar topics produce similar vectors.

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env var

def generate_embeddings(
    chunks: list[dict],
    model: str = "text-embedding-3-small",
) -> list[dict]:
    """Add an 'embedding' key to each chunk dict.

    Sends chunks in batches to the OpenAI embedding API.
    """
    batch_size = 100  # API supports up to 2048 inputs per call
    texts = [chunk["text"] for chunk in chunks]

    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        response = client.embeddings.create(
            model=model,
            input=batch,
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)

    # Attach embeddings to chunk dicts
    for chunk, embedding in zip(chunks, all_embeddings):
        chunk["embedding"] = embedding

    return chunks

Why text-embedding-3-small?

It produces 1536-dimensional vectors with strong semantic quality
It costs $0.02 per million tokens — embedding an entire book costs pennies
For higher quality at 2x the cost, use text-embedding-3-large (3072 dimensions)
For zero cost, use a local model like all-MiniLM-L6-v2 via sentence-transformers (384 dimensions, runs on CPU)

Step 4: Store in ChromaDB

ChromaDB is an open-source vector database that runs locally — no server setup, no cloud account. It stores vectors alongside metadata and supports fast similarity search.

import chromadb

def create_vector_store(
    chunks: list[dict],
    collection_name: str = "rag_tutorial",
) -> chromadb.Collection:
    """Store chunks with embeddings in a ChromaDB collection.

    Returns the collection for later querying.
    """
    chroma_client = chromadb.Client()  # In-memory — use PersistentClient for disk

    # Delete existing collection if it exists (clean re-index)
    try:
        chroma_client.delete_collection(collection_name)
    except ValueError:
        pass

    collection = chroma_client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},  # Cosine similarity
    )

    collection.add(
        ids=[f"chunk_{i}" for i in range(len(chunks))],
        embeddings=[chunk["embedding"] for chunk in chunks],
        documents=[chunk["text"] for chunk in chunks],
        metadatas=[
            {"source": chunk["source"], "chunk_index": chunk["chunk_index"]}
            for chunk in chunks
        ],
    )

    return collection

Key configuration:

hnsw:space: "cosine" — cosine similarity is the standard distance metric for OpenAI embeddings. Documents with similar meaning produce vectors that point in similar directions.
chromadb.Client() — in-memory storage. For persistence across sessions, use chromadb.PersistentClient(path="./chroma_db").
Metadata is stored alongside vectors — this enables filtering by source, date, or any custom field during retrieval.

Step 5: Retrieve Relevant Chunks

Given a user query, embed it with the same model and find the closest chunks by cosine similarity.

def retrieve(
    query: str,
    collection: chromadb.Collection,
    n_results: int = 5,
) -> list[dict]:
    """Retrieve the most relevant chunks for a query.

    Returns chunks ranked by cosine similarity to the query embedding.
    """
    # Embed the query using the same model
    query_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query,
    )
    query_embedding = query_response.data[0].embedding

    # Search the vector store
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
    )

    # Format results
    retrieved_chunks = []
    for i in range(len(results["documents"][0])):
        retrieved_chunks.append({
            "text": results["documents"][0][i],
            "source": results["metadatas"][0][i]["source"],
            "distance": results["distances"][0][i],
        })

    return retrieved_chunks

Why n_results=5?

Fewer results (1–2) risk missing relevant context. More results (10+) add noise that can confuse the LLM.
5 is a practical default. After building the pipeline, tune this number based on your specific documents and queries.
The distance field tells you how similar each chunk is — useful for debugging and setting relevance thresholds.

Step 6: Generate an Answer with Context

Assemble the retrieved chunks into a prompt and send it to the LLM. The system prompt instructs the model to answer only from the provided context.

def generate_answer(
    query: str,
    retrieved_chunks: list[dict],
    model: str = "gpt-4o-mini",
) -> str:
    """Generate an answer grounded in the retrieved context.

    The LLM is instructed to use only the provided context.
    """
    # Build context from retrieved chunks
    context_parts = []
    for i, chunk in enumerate(retrieved_chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['source']}]\n{chunk['text']}"
        )
    context = "\n\n---\n\n".join(context_parts)

    system_prompt = (
        "You are a helpful assistant that answers questions based on "
        "the provided context. Use only the information in the context "
        "to answer. If the context does not contain enough information "
        "to answer the question, say so explicitly. Cite the source "
        "number (e.g., [Source 1]) when using information from a "
        "specific passage."
    )

    user_prompt = (
        f"Context:\n{context}\n\n---\n\n"
        f"Question: {query}\n\n"
        f"Answer based on the context above:"
    )

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0,
    )

    return response.choices[0].message.content

Key prompt engineering decisions:

temperature=0 — deterministic output. For RAG, you want the LLM to faithfully report what is in the context, not creatively interpret it.
Source citation instruction — the LLM is told to cite [Source N], linking answers back to specific retrieved chunks. This is basic grounding.
Explicit refusal instruction — “if the context does not contain enough information, say so” prevents the LLM from hallucinating when retrieval fails.

Putting It All Together

Here is the complete pipeline in a single runnable script:

def run_rag_pipeline(
    file_paths: list[str],
    query: str,
) -> str:
    """Run the complete RAG pipeline end to end."""
    # Phase 1: Indexing
    print("Loading documents...")
    documents = load_documents(file_paths)
    print(f"Loaded {len(documents)} documents")

    print("Chunking...")
    chunks = chunk_documents(documents)
    print(f"Created {len(chunks)} chunks")

    print("Generating embeddings...")
    chunks = generate_embeddings(chunks)
    print("Embeddings generated")

    print("Storing in vector database...")
    collection = create_vector_store(chunks)
    print(f"Stored {collection.count()} chunks in ChromaDB")

    # Phase 2: Query
    print(f"\nQuery: {query}")
    print("Retrieving relevant chunks...")
    retrieved = retrieve(query, collection)
    print(f"Retrieved {len(retrieved)} chunks")

    print("Generating answer...")
    answer = generate_answer(query, retrieved)

    return answer

# Run it
if __name__ == "__main__":
    answer = run_rag_pipeline(
        file_paths=["your-document.pdf"],
        query="What are the main points of this document?",
    )
    print(f"\nAnswer:\n{answer}")

Run this against any PDF or text file on your machine. Replace the file path and query with your own. The output includes the generated answer with source citations.

5. RAG Component Stack

Each layer of the RAG pipeline has a specific responsibility. Understanding this stack helps you swap components — replace ChromaDB with Pinecone, replace OpenAI embeddings with a local model, or add a reranking layer.

RAG Pipeline Component Stack

Each layer can be swapped independently. The interfaces between layers stay the same.

LLM Generation Layer

GPT-4o-mini, Claude, Llama 3 -- produces grounded answers from context

Prompt Assembly Layer

System prompt + retrieved chunks + user query -- controls LLM behavior

Retrieval Layer

Vector similarity search -- finds the most relevant chunks for a query

Vector Storage Layer

ChromaDB, Pinecone, Qdrant -- indexes and serves embedding vectors

Embedding Layer

text-embedding-3-small, all-MiniLM-L6-v2 -- converts text to vectors

Document Processing Layer

Loading, chunking, metadata extraction -- prepares raw documents

Idle

The stack diagram shows why RAG is modular. You can upgrade any layer without rewriting the others. Want better retrieval? Add a reranking layer between retrieval and prompt assembly. Want lower cost? Swap OpenAI embeddings for a local model. Want persistence? Replace the in-memory ChromaDB client with a persistent one.

6. RAG Pipeline Enhancements

The basic pipeline works, but production RAG systems add three enhancements that measurably improve answer quality.

Enhancement 1: Hybrid Search

Vector search finds semantically similar content but misses exact keyword matches: product names, error codes, version numbers. Hybrid search combines vector similarity with keyword matching.

Approach: Run vector search to get candidates, then re-score by combining cosine similarity (70% weight) with keyword overlap (30% weight). ChromaDB supports where_document filters for basic keyword matching alongside vector search. For full BM25 keyword scoring, add Elasticsearch or use Weaviate which supports both natively.

# Core idea — combine scores from both retrieval methods
combined_score = 0.7 * vector_similarity + 0.3 * keyword_overlap

Enhancement 2: Reranking

Initial retrieval ranks chunks by embedding similarity — a rough approximation. A cross-encoder reranker takes each (query, chunk) pair and produces a precise relevance score. It is more accurate but slower, so it runs only on the top 10–20 candidates from the initial retrieval.

Approach: Retrieve a larger set (e.g., 20 chunks), then rerank with a cross-encoder model like cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers. The reranker processes each (query, chunk) pair and outputs a relevance score. Keep the top 3–5 chunks for the LLM. For a quick prototype, use the LLM itself as a reranker by asking it to rate relevance on a 1–10 scale — but this is slow and expensive at scale.

Enhancement 3: Multi-Query Retrieval

A single query phrasing may miss relevant chunks when the user’s vocabulary differs from the document’s language. Multi-query retrieval generates 3–5 alternative phrasings of the question (using the LLM), retrieves against each, and merges results by deduplicating on chunk text.

Approach: Send the original query to the LLM with a prompt like “Generate 3 alternative phrasings of this question.” Retrieve top-3 for each variant. Deduplicate by chunk content. This broadens recall without sacrificing precision.

These three enhancements address the three most common RAG failure modes: keyword mismatch, imprecise ranking, and query-document vocabulary gap. Apply them incrementally based on where your pipeline underperforms.

7. Framework RAG vs Custom RAG

The pipeline you built in this tutorial is custom RAG — every component is explicit code you control. Framework RAG (LangChain, LlamaIndex) abstracts these components behind high-level APIs. Both approaches have clear trade-offs.

Framework RAG vs Custom RAG

Framework RAG

LangChain, LlamaIndex — high-level abstractions

Build a working RAG pipeline in 10 lines of code
Pre-built integrations for 50+ vector stores and LLMs
Community recipes for common patterns like hybrid search
Abstraction hides retrieval details — hard to debug quality issues
Framework updates can break your pipeline silently
Performance overhead from abstraction layers adds latency
Lock-in to framework-specific patterns and data structures

Custom RAG

Direct API calls — explicit control over every component

Full visibility into every step — easy to debug retrieval failures
No framework overhead — minimal latency from your code
Swap any component without framework migration
Deep understanding of RAG internals for interviews and design
More code to write and maintain for each integration
You build common patterns (retry, batching) yourself
No community ecosystem for pre-built retrievers or chains

Verdict: Build custom first to understand the internals. Move to a framework when you need its integrations. Most production teams use frameworks for orchestration but customize the retrieval and generation layers.

Use Framework RAG when…

Rapid prototyping, multi-provider pipelines, teams already using the framework ecosystem

Use Custom RAG when…

Learning RAG, performance-critical systems, teams that need full control over retrieval quality

The recommendation: start custom (which you just did), then adopt a framework when the integration benefits outweigh the abstraction costs. Many production systems are hybrid — framework orchestration with custom retrieval logic.

8. Interview Questions

RAG pipeline implementation is one of the most frequently tested topics in GenAI engineering interviews. These questions test whether you can build, not just describe.

Q1: Walk me through building a RAG pipeline from scratch.

What interviewers want: A clear, ordered explanation of the six stages — load, chunk, embed, store, retrieve, generate — with specific technology choices and justification for each.

Strong answer structure: Start with the two-phase architecture (offline indexing, online retrieval). For each stage, name the specific tool (PyPDF2, tiktoken, text-embedding-3-small, ChromaDB), explain the key configuration choice (chunk size 500, overlap 50, cosine similarity, top-5 retrieval, temperature 0), and state why that choice matters. End with the system prompt design — how you instruct the LLM to stay grounded and cite sources.

Q2: Your RAG pipeline retrieves chunks but the answer is wrong. How do you debug it?

What interviewers want: A systematic debugging approach, not guesswork.

Strong answer: Inspect the retrieved chunks first. If the relevant information is not in the retrieved chunks, the problem is retrieval — adjust chunk size, overlap, or embedding model. If the relevant information is in the chunks but the answer ignores it, the problem is generation — adjust the system prompt, reduce the number of chunks (noise), or lower temperature. If the information is partially in multiple chunks that were not all retrieved, the problem is fragmentation — increase overlap or switch to hierarchical chunking.

Q3: How would you add hybrid search to this pipeline?

What interviewers want: Understanding that vector search alone has failure modes, and practical knowledge of how to combine semantic and keyword matching.

Strong answer: Vector search excels at semantic matching but misses exact terms — product names, error codes, acronyms. Add BM25 or keyword filtering alongside vector search. Merge results using reciprocal rank fusion: for each document, combine its rank from both methods. This captures both semantic relevance and keyword precision. ChromaDB supports basic keyword filtering via where_document; for full BM25, add Elasticsearch or use a database that supports both natively like Weaviate.

Q4: What happens if your documents change frequently? How do you keep the RAG pipeline current?

What interviewers want: Awareness of the operational challenge of maintaining a RAG system beyond the initial build.

Strong answer: Track document versions with content hashes. When a document changes, re-chunk and re-embed only the changed document — do not re-index the entire corpus. Use ChromaDB’s upsert to replace stale chunks. For large-scale systems, build an incremental indexing pipeline that watches for file changes (or database update timestamps) and processes only deltas. Set up monitoring to track embedding freshness — the gap between the most recent document update and the most recent re-index.

9. RAG Pipeline in Production

The pipeline you built works for learning and prototyping. Moving it to production requires attention to three areas: scaling, evaluation, and monitoring.

Scaling

The in-memory ChromaDB client works for thousands of chunks. For millions, switch to a managed vector database:

Scale	Recommended Approach
<10K chunks	ChromaDB in-memory or persistent local
10K–1M chunks	ChromaDB persistent, Qdrant local, or FAISS
1M+ chunks	Pinecone, Qdrant Cloud, or Weaviate Cloud

Embedding generation is the slowest part of indexing. For large document sets, use batch embedding with the approach shown in Step 3 (100 chunks per API call). For very large corpora (>100K documents), consider local embedding models to eliminate API costs and rate limits.

Evaluation

RAG evaluation measures four dimensions. Without measurement, you cannot systematically improve quality.

Metric	What It Measures	How to Compute
Context Precision	Are the retrieved chunks relevant to the query?	LLM-as-judge scores each chunk’s relevance
Context Recall	Did retrieval find all relevant chunks?	Compare retrieved chunks against a labeled ground truth set
Faithfulness	Does the answer stay grounded in the retrieved context?	LLM-as-judge checks each claim against the provided chunks
Answer Relevancy	Does the answer address the original question?	LLM-as-judge scores answer-question alignment

Start with a test set of 20–50 (question, expected answer, source document) triples. Run your pipeline on each question. Score each metric. This gives you a baseline. Every change to chunking, retrieval, or prompting should improve at least one metric without degrading the others.

Monitoring

Production RAG systems need three types of monitoring:

Retrieval quality — log the query, retrieved chunks, and distances for every request. Flag queries where the top chunk’s similarity score is below a threshold (e.g., cosine distance >0.5) — these indicate retrieval failures.
Latency — measure embedding time, search time, and generation time separately. Total latency should be under 3 seconds for a good user experience. If embedding dominates, batch or cache queries. If generation dominates, use a faster model or enable streaming.
Index freshness — track the timestamp of the most recently indexed document. If documents are updated daily but your index is a week old, users get stale answers.

You built a complete RAG pipeline in Python: six functions, no framework dependencies, and a working system that answers questions about your documents. The pipeline loads documents, chunks them with overlap, generates embeddings, stores vectors in ChromaDB, retrieves relevant chunks by similarity, and generates grounded answers.

Key takeaways:

RAG has two phases: offline indexing (load, chunk, embed, store) and online query (retrieve, generate)
Chunking parameters (size and overlap) have more impact on quality than most engineers expect
The system prompt is the primary control for grounding — it tells the LLM to use only the retrieved context
Hybrid search, reranking, and multi-query retrieval address the three most common retrieval failure modes
Evaluation requires a test set and four metrics: precision, recall, faithfulness, and relevancy

RAG Architecture and Production Patterns — Design concepts, chunking strategies, and production failure modes for RAG systems
Vector Database Comparison — Compare Pinecone, Qdrant, Weaviate, ChromaDB, and FAISS for your RAG pipeline
Fine-Tuning vs RAG — When to use RAG, when to fine-tune, and when to combine both approaches
Python for GenAI Engineers — Production Python patterns including Pydantic, error handling, and async code
Async Python for GenAI — Make your RAG pipeline async for parallel retrieval and streaming responses
GenAI Engineer Interview Questions — RAG implementation questions appear in mid and senior-level technical rounds

Last updated: March 2026. Code examples use OpenAI Python SDK v1.x, ChromaDB v0.5.x, and PyPDF2 v3.x. The pipeline architecture applies regardless of which specific libraries or models you choose.

Frequently Asked Questions

What libraries do I need to build a RAG pipeline in Python?

A minimal RAG pipeline needs four libraries: openai for embeddings and LLM generation, chromadb for vector storage, PyPDF2 for document loading, and tiktoken for token counting. Install with pip install openai chromadb PyPDF2 tiktoken. For production systems, you may add LangChain for orchestration or sentence-transformers for local embedding models.

How do I choose the right chunk size for my RAG pipeline?

Start with 500-token chunks and 50-token overlap. Smaller chunks (200-300 tokens) improve retrieval precision but lose surrounding context. Larger chunks (800-1000 tokens) preserve context but dilute relevance signals. Test with your actual documents and queries — measure retrieval quality by checking whether the returned chunks contain the answer to your test questions.

Why does my RAG pipeline return irrelevant results?

The most common causes are: chunks that are too large (diluting the relevance signal), no overlap between chunks (splitting key information across boundaries), poor query formulation (the query phrasing does not match the document language), or insufficient top-k results. Try reducing chunk size, increasing overlap, rephrasing queries, or adding hybrid search with keyword matching alongside vector search.

Can I build a RAG pipeline without OpenAI?

Yes. Replace OpenAI embeddings with sentence-transformers running locally (e.g., all-MiniLM-L6-v2) and replace the OpenAI LLM with Ollama running a local model like Llama 3 or Mistral. ChromaDB works with any embedding model. The pipeline architecture stays identical — only the embedding and generation functions change.

How is this tutorial different from the RAG architecture guide?

The RAG architecture guide covers design concepts: what RAG is, architectural patterns, chunking strategies, hybrid search theory, and production failure modes. This tutorial is a hands-on Python implementation. You write and run every line of code: loading documents, chunking text, generating embeddings, storing vectors in ChromaDB, retrieving relevant chunks, and generating answers. Start here to build, then read the architecture guide to design production systems.