RAG Pipeline Tutorial — Build a Working RAG System in Python (2026)
This tutorial walks you through building a complete RAG pipeline in Python — from loading raw documents to generating grounded answers. Every code block runs. By the end, you will have a working system that answers questions about your own documents.
Prerequisites: Python 3.10+, an OpenAI API key, and basic familiarity with Python. No prior RAG experience required.
1. Why Build a RAG Pipeline from Scratch
Section titled “1. Why Build a RAG Pipeline from Scratch”Frameworks like LangChain and LlamaIndex abstract the RAG pipeline into a few function calls. That abstraction is useful in production — but it hides the mechanics. When retrieval quality is poor, when answers hallucinate, when latency is unacceptable, you need to understand what is happening at each stage to diagnose and fix the problem.
Building a RAG pipeline from scratch teaches you:
- What each component does — not conceptually, but in working code you can step through
- Where quality bottlenecks live — chunking decisions, embedding choices, retrieval parameters
- How to debug retrieval failures — why a relevant document was not retrieved, why an irrelevant one was
- What frameworks abstract away — so you know when the abstraction helps and when it hinders
This is not an exercise in reinventing the wheel. It is an exercise in understanding the wheel so you can build better vehicles. After completing this tutorial, framework-based RAG code will make significantly more sense because you will have implemented every layer yourself.
The entire pipeline fits in approximately 100 lines of Python. No framework magic. No hidden complexity.
2. What You Will Build
Section titled “2. What You Will Build”The RAG pipeline you build in this tutorial has six stages. Each stage is a discrete function you can inspect, modify, and test independently.
Pipeline overview:
1. Load Documents → Read PDF or text files into raw strings2. Chunk Text → Split documents into overlapping segments (500 tokens, 50-token overlap)3. Generate Embeddings → Convert each chunk into a vector using OpenAI's embedding model4. Store in Vector DB → Index vectors in ChromaDB for fast similarity search5. Retrieve Chunks → Given a user query, find the most relevant chunks6. Generate Answer → Pass retrieved chunks + query to an LLM for a grounded responseWhat you will need:
pip install openai chromadb PyPDF2 tiktoken- openai — embedding generation and LLM calls
- chromadb — local vector database (no server required)
- PyPDF2 — PDF document loading
- tiktoken — accurate token counting for chunk sizing
Set your API key as an environment variable before running any code:
export OPENAI_API_KEY="your-api-key-here"The total cost to run this tutorial is under $0.05 in OpenAI API calls. Embedding a few documents and making a handful of LLM calls uses minimal tokens.
3. RAG Pipeline Architecture
Section titled “3. RAG Pipeline Architecture”Before writing code, understand the data flow. A RAG pipeline has two phases: an offline indexing phase (steps 1–4, run once per document set) and an online query phase (steps 5–6, run on every user question).
Data Flow Diagram
Section titled “Data Flow Diagram”RAG Pipeline — End-to-End Data Flow
The indexing phase runs once per document set. The query phase runs on every user question. Both connect through the vector store.
The indexing phase transforms raw documents into searchable vectors. The query phase converts the user’s question into a vector, finds the closest document chunks, and passes them as context to the LLM. The LLM generates an answer grounded in the retrieved content — not from its training data alone.
4. Build the RAG Pipeline Step by Step
Section titled “4. Build the RAG Pipeline Step by Step”This is the core of the tutorial. Each step is a standalone function with a clear input and output. Together, they form the complete pipeline.
Step 1: Load Documents
Section titled “Step 1: Load Documents”Load text from PDF files or plain text files. This function handles both formats and returns a list of document strings.
from PyPDF2 import PdfReaderfrom pathlib import Path
def load_documents(file_paths: list[str]) -> list[dict]: """Load documents from PDF or text files.
Returns a list of dicts with 'text' and 'source' keys. """ documents = [] for path in file_paths: file_path = Path(path) if file_path.suffix == ".pdf": reader = PdfReader(str(file_path)) text = "\n".join( page.extract_text() or "" for page in reader.pages ) elif file_path.suffix in (".txt", ".md"): text = file_path.read_text(encoding="utf-8") else: print(f"Skipping unsupported file: {path}") continue
if text.strip(): documents.append({"text": text, "source": str(file_path)})
return documentsKey decisions:
- Each document carries its
sourcemetadata — this enables citation in the final answer - Empty pages are skipped (common in scanned PDFs with OCR failures)
- The function is format-agnostic — add more file types by extending the
if/elifchain
Step 2: Chunk Text with Overlap
Section titled “Step 2: Chunk Text with Overlap”Chunking splits documents into segments small enough for embedding but large enough to preserve context. Overlap ensures no information is lost at chunk boundaries.
import tiktoken
def chunk_text( text: str, source: str, chunk_size: int = 500, overlap: int = 50,) -> list[dict]: """Split text into overlapping chunks based on token count.
Returns a list of dicts with 'text', 'source', and 'chunk_index' keys. """ encoder = tiktoken.encoding_for_model("gpt-4o-mini") tokens = encoder.encode(text) chunks = [] start = 0
while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunk_text = encoder.decode(chunk_tokens)
chunks.append({ "text": chunk_text, "source": source, "chunk_index": len(chunks), })
# Move forward by chunk_size minus overlap start += chunk_size - overlap
return chunks
def chunk_documents(documents: list[dict], **kwargs) -> list[dict]: """Chunk all documents and return a flat list of chunks.""" all_chunks = [] for doc in documents: doc_chunks = chunk_text(doc["text"], doc["source"], **kwargs) all_chunks.extend(doc_chunks) return all_chunksWhy 500 tokens with 50-token overlap?
- 500 tokens is large enough to contain a complete paragraph or concept, but small enough that the embedding vector captures a focused topic. Larger chunks dilute the relevance signal. Smaller chunks lose context.
- 50-token overlap ensures that a sentence split across two chunks appears in full in at least one of them. Without overlap, important information at chunk boundaries is effectively invisible to retrieval.
These defaults work well for most documents. Adjust them based on your specific content — shorter for FAQ-style documents, longer for dense technical writing.
Step 3: Generate Embeddings
Section titled “Step 3: Generate Embeddings”Convert each text chunk into a numerical vector using OpenAI’s embedding model. These vectors capture semantic meaning — chunks about similar topics produce similar vectors.
from openai import OpenAI
client = OpenAI() # Uses OPENAI_API_KEY env var
def generate_embeddings( chunks: list[dict], model: str = "text-embedding-3-small",) -> list[dict]: """Add an 'embedding' key to each chunk dict.
Sends chunks in batches to the OpenAI embedding API. """ batch_size = 100 # API supports up to 2048 inputs per call texts = [chunk["text"] for chunk in chunks]
all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i : i + batch_size] response = client.embeddings.create( model=model, input=batch, ) batch_embeddings = [item.embedding for item in response.data] all_embeddings.extend(batch_embeddings)
# Attach embeddings to chunk dicts for chunk, embedding in zip(chunks, all_embeddings): chunk["embedding"] = embedding
return chunksWhy text-embedding-3-small?
- It produces 1536-dimensional vectors with strong semantic quality
- It costs $0.02 per million tokens — embedding an entire book costs pennies
- For higher quality at 2x the cost, use
text-embedding-3-large(3072 dimensions) - For zero cost, use a local model like
all-MiniLM-L6-v2viasentence-transformers(384 dimensions, runs on CPU)
Step 4: Store in ChromaDB
Section titled “Step 4: Store in ChromaDB”ChromaDB is an open-source vector database that runs locally — no server setup, no cloud account. It stores vectors alongside metadata and supports fast similarity search.
import chromadb
def create_vector_store( chunks: list[dict], collection_name: str = "rag_tutorial",) -> chromadb.Collection: """Store chunks with embeddings in a ChromaDB collection.
Returns the collection for later querying. """ chroma_client = chromadb.Client() # In-memory — use PersistentClient for disk
# Delete existing collection if it exists (clean re-index) try: chroma_client.delete_collection(collection_name) except ValueError: pass
collection = chroma_client.create_collection( name=collection_name, metadata={"hnsw:space": "cosine"}, # Cosine similarity )
collection.add( ids=[f"chunk_{i}" for i in range(len(chunks))], embeddings=[chunk["embedding"] for chunk in chunks], documents=[chunk["text"] for chunk in chunks], metadatas=[ {"source": chunk["source"], "chunk_index": chunk["chunk_index"]} for chunk in chunks ], )
return collectionKey configuration:
hnsw:space: "cosine"— cosine similarity is the standard distance metric for OpenAI embeddings. Documents with similar meaning produce vectors that point in similar directions.chromadb.Client()— in-memory storage. For persistence across sessions, usechromadb.PersistentClient(path="./chroma_db").- Metadata is stored alongside vectors — this enables filtering by source, date, or any custom field during retrieval.
Step 5: Retrieve Relevant Chunks
Section titled “Step 5: Retrieve Relevant Chunks”Given a user query, embed it with the same model and find the closest chunks by cosine similarity.
def retrieve( query: str, collection: chromadb.Collection, n_results: int = 5,) -> list[dict]: """Retrieve the most relevant chunks for a query.
Returns chunks ranked by cosine similarity to the query embedding. """ # Embed the query using the same model query_response = client.embeddings.create( model="text-embedding-3-small", input=query, ) query_embedding = query_response.data[0].embedding
# Search the vector store results = collection.query( query_embeddings=[query_embedding], n_results=n_results, )
# Format results retrieved_chunks = [] for i in range(len(results["documents"][0])): retrieved_chunks.append({ "text": results["documents"][0][i], "source": results["metadatas"][0][i]["source"], "distance": results["distances"][0][i], })
return retrieved_chunksWhy n_results=5?
- Fewer results (1–2) risk missing relevant context. More results (10+) add noise that can confuse the LLM.
- 5 is a practical default. After building the pipeline, tune this number based on your specific documents and queries.
- The
distancefield tells you how similar each chunk is — useful for debugging and setting relevance thresholds.
Step 6: Generate an Answer with Context
Section titled “Step 6: Generate an Answer with Context”Assemble the retrieved chunks into a prompt and send it to the LLM. The system prompt instructs the model to answer only from the provided context.
def generate_answer( query: str, retrieved_chunks: list[dict], model: str = "gpt-4o-mini",) -> str: """Generate an answer grounded in the retrieved context.
The LLM is instructed to use only the provided context. """ # Build context from retrieved chunks context_parts = [] for i, chunk in enumerate(retrieved_chunks, 1): context_parts.append( f"[Source {i}: {chunk['source']}]\n{chunk['text']}" ) context = "\n\n---\n\n".join(context_parts)
system_prompt = ( "You are a helpful assistant that answers questions based on " "the provided context. Use only the information in the context " "to answer. If the context does not contain enough information " "to answer the question, say so explicitly. Cite the source " "number (e.g., [Source 1]) when using information from a " "specific passage." )
user_prompt = ( f"Context:\n{context}\n\n---\n\n" f"Question: {query}\n\n" f"Answer based on the context above:" )
response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ], temperature=0, )
return response.choices[0].message.contentKey prompt engineering decisions:
temperature=0— deterministic output. For RAG, you want the LLM to faithfully report what is in the context, not creatively interpret it.- Source citation instruction — the LLM is told to cite
[Source N], linking answers back to specific retrieved chunks. This is basic grounding. - Explicit refusal instruction — “if the context does not contain enough information, say so” prevents the LLM from hallucinating when retrieval fails.
Putting It All Together
Section titled “Putting It All Together”Here is the complete pipeline in a single runnable script:
def run_rag_pipeline( file_paths: list[str], query: str,) -> str: """Run the complete RAG pipeline end to end.""" # Phase 1: Indexing print("Loading documents...") documents = load_documents(file_paths) print(f"Loaded {len(documents)} documents")
print("Chunking...") chunks = chunk_documents(documents) print(f"Created {len(chunks)} chunks")
print("Generating embeddings...") chunks = generate_embeddings(chunks) print("Embeddings generated")
print("Storing in vector database...") collection = create_vector_store(chunks) print(f"Stored {collection.count()} chunks in ChromaDB")
# Phase 2: Query print(f"\nQuery: {query}") print("Retrieving relevant chunks...") retrieved = retrieve(query, collection) print(f"Retrieved {len(retrieved)} chunks")
print("Generating answer...") answer = generate_answer(query, retrieved)
return answer
# Run itif __name__ == "__main__": answer = run_rag_pipeline( file_paths=["your-document.pdf"], query="What are the main points of this document?", ) print(f"\nAnswer:\n{answer}")Run this against any PDF or text file on your machine. Replace the file path and query with your own. The output includes the generated answer with source citations.
5. RAG Component Stack
Section titled “5. RAG Component Stack”Each layer of the RAG pipeline has a specific responsibility. Understanding this stack helps you swap components — replace ChromaDB with Pinecone, replace OpenAI embeddings with a local model, or add a reranking layer.
RAG Pipeline Component Stack
Each layer can be swapped independently. The interfaces between layers stay the same.
The stack diagram shows why RAG is modular. You can upgrade any layer without rewriting the others. Want better retrieval? Add a reranking layer between retrieval and prompt assembly. Want lower cost? Swap OpenAI embeddings for a local model. Want persistence? Replace the in-memory ChromaDB client with a persistent one.
6. RAG Pipeline Enhancements
Section titled “6. RAG Pipeline Enhancements”The basic pipeline works, but production RAG systems add three enhancements that measurably improve answer quality.
Enhancement 1: Hybrid Search
Section titled “Enhancement 1: Hybrid Search”Vector search finds semantically similar content but misses exact keyword matches: product names, error codes, version numbers. Hybrid search combines vector similarity with keyword matching.
Approach: Run vector search to get candidates, then re-score by combining cosine similarity (70% weight) with keyword overlap (30% weight). ChromaDB supports where_document filters for basic keyword matching alongside vector search. For full BM25 keyword scoring, add Elasticsearch or use Weaviate which supports both natively.
# Core idea — combine scores from both retrieval methodscombined_score = 0.7 * vector_similarity + 0.3 * keyword_overlapEnhancement 2: Reranking
Section titled “Enhancement 2: Reranking”Initial retrieval ranks chunks by embedding similarity — a rough approximation. A cross-encoder reranker takes each (query, chunk) pair and produces a precise relevance score. It is more accurate but slower, so it runs only on the top 10–20 candidates from the initial retrieval.
Approach: Retrieve a larger set (e.g., 20 chunks), then rerank with a cross-encoder model like cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers. The reranker processes each (query, chunk) pair and outputs a relevance score. Keep the top 3–5 chunks for the LLM. For a quick prototype, use the LLM itself as a reranker by asking it to rate relevance on a 1–10 scale — but this is slow and expensive at scale.
Enhancement 3: Multi-Query Retrieval
Section titled “Enhancement 3: Multi-Query Retrieval”A single query phrasing may miss relevant chunks when the user’s vocabulary differs from the document’s language. Multi-query retrieval generates 3–5 alternative phrasings of the question (using the LLM), retrieves against each, and merges results by deduplicating on chunk text.
Approach: Send the original query to the LLM with a prompt like “Generate 3 alternative phrasings of this question.” Retrieve top-3 for each variant. Deduplicate by chunk content. This broadens recall without sacrificing precision.
These three enhancements address the three most common RAG failure modes: keyword mismatch, imprecise ranking, and query-document vocabulary gap. Apply them incrementally based on where your pipeline underperforms.
7. Framework RAG vs Custom RAG
Section titled “7. Framework RAG vs Custom RAG”The pipeline you built in this tutorial is custom RAG — every component is explicit code you control. Framework RAG (LangChain, LlamaIndex) abstracts these components behind high-level APIs. Both approaches have clear trade-offs.
Framework RAG vs Custom RAG
- Build a working RAG pipeline in 10 lines of code
- Pre-built integrations for 50+ vector stores and LLMs
- Community recipes for common patterns like hybrid search
- Abstraction hides retrieval details — hard to debug quality issues
- Framework updates can break your pipeline silently
- Performance overhead from abstraction layers adds latency
- Lock-in to framework-specific patterns and data structures
- Full visibility into every step — easy to debug retrieval failures
- No framework overhead — minimal latency from your code
- Swap any component without framework migration
- Deep understanding of RAG internals for interviews and design
- More code to write and maintain for each integration
- You build common patterns (retry, batching) yourself
- No community ecosystem for pre-built retrievers or chains
The recommendation: start custom (which you just did), then adopt a framework when the integration benefits outweigh the abstraction costs. Many production systems are hybrid — framework orchestration with custom retrieval logic.
8. Interview Questions
Section titled “8. Interview Questions”RAG pipeline implementation is one of the most frequently tested topics in GenAI engineering interviews. These questions test whether you can build, not just describe.
Q1: Walk me through building a RAG pipeline from scratch.
Section titled “Q1: Walk me through building a RAG pipeline from scratch.”What interviewers want: A clear, ordered explanation of the six stages — load, chunk, embed, store, retrieve, generate — with specific technology choices and justification for each.
Strong answer structure: Start with the two-phase architecture (offline indexing, online retrieval). For each stage, name the specific tool (PyPDF2, tiktoken, text-embedding-3-small, ChromaDB), explain the key configuration choice (chunk size 500, overlap 50, cosine similarity, top-5 retrieval, temperature 0), and state why that choice matters. End with the system prompt design — how you instruct the LLM to stay grounded and cite sources.
Q2: Your RAG pipeline retrieves chunks but the answer is wrong. How do you debug it?
Section titled “Q2: Your RAG pipeline retrieves chunks but the answer is wrong. How do you debug it?”What interviewers want: A systematic debugging approach, not guesswork.
Strong answer: Inspect the retrieved chunks first. If the relevant information is not in the retrieved chunks, the problem is retrieval — adjust chunk size, overlap, or embedding model. If the relevant information is in the chunks but the answer ignores it, the problem is generation — adjust the system prompt, reduce the number of chunks (noise), or lower temperature. If the information is partially in multiple chunks that were not all retrieved, the problem is fragmentation — increase overlap or switch to hierarchical chunking.
Q3: How would you add hybrid search to this pipeline?
Section titled “Q3: How would you add hybrid search to this pipeline?”What interviewers want: Understanding that vector search alone has failure modes, and practical knowledge of how to combine semantic and keyword matching.
Strong answer: Vector search excels at semantic matching but misses exact terms — product names, error codes, acronyms. Add BM25 or keyword filtering alongside vector search. Merge results using reciprocal rank fusion: for each document, combine its rank from both methods. This captures both semantic relevance and keyword precision. ChromaDB supports basic keyword filtering via where_document; for full BM25, add Elasticsearch or use a database that supports both natively like Weaviate.
Q4: What happens if your documents change frequently? How do you keep the RAG pipeline current?
Section titled “Q4: What happens if your documents change frequently? How do you keep the RAG pipeline current?”What interviewers want: Awareness of the operational challenge of maintaining a RAG system beyond the initial build.
Strong answer: Track document versions with content hashes. When a document changes, re-chunk and re-embed only the changed document — do not re-index the entire corpus. Use ChromaDB’s upsert to replace stale chunks. For large-scale systems, build an incremental indexing pipeline that watches for file changes (or database update timestamps) and processes only deltas. Set up monitoring to track embedding freshness — the gap between the most recent document update and the most recent re-index.
9. RAG Pipeline in Production
Section titled “9. RAG Pipeline in Production”The pipeline you built works for learning and prototyping. Moving it to production requires attention to three areas: scaling, evaluation, and monitoring.
Scaling
Section titled “Scaling”The in-memory ChromaDB client works for thousands of chunks. For millions, switch to a managed vector database:
| Scale | Recommended Approach |
|---|---|
| <10K chunks | ChromaDB in-memory or persistent local |
| 10K–1M chunks | ChromaDB persistent, Qdrant local, or FAISS |
| 1M+ chunks | Pinecone, Qdrant Cloud, or Weaviate Cloud |
Embedding generation is the slowest part of indexing. For large document sets, use batch embedding with the approach shown in Step 3 (100 chunks per API call). For very large corpora (>100K documents), consider local embedding models to eliminate API costs and rate limits.
Evaluation
Section titled “Evaluation”RAG evaluation measures four dimensions. Without measurement, you cannot systematically improve quality.
| Metric | What It Measures | How to Compute |
|---|---|---|
| Context Precision | Are the retrieved chunks relevant to the query? | LLM-as-judge scores each chunk’s relevance |
| Context Recall | Did retrieval find all relevant chunks? | Compare retrieved chunks against a labeled ground truth set |
| Faithfulness | Does the answer stay grounded in the retrieved context? | LLM-as-judge checks each claim against the provided chunks |
| Answer Relevancy | Does the answer address the original question? | LLM-as-judge scores answer-question alignment |
Start with a test set of 20–50 (question, expected answer, source document) triples. Run your pipeline on each question. Score each metric. This gives you a baseline. Every change to chunking, retrieval, or prompting should improve at least one metric without degrading the others.
Monitoring
Section titled “Monitoring”Production RAG systems need three types of monitoring:
- Retrieval quality — log the query, retrieved chunks, and distances for every request. Flag queries where the top chunk’s similarity score is below a threshold (e.g., cosine distance >0.5) — these indicate retrieval failures.
- Latency — measure embedding time, search time, and generation time separately. Total latency should be under 3 seconds for a good user experience. If embedding dominates, batch or cache queries. If generation dominates, use a faster model or enable streaming.
- Index freshness — track the timestamp of the most recently indexed document. If documents are updated daily but your index is a week old, users get stale answers.
10. Summary and Related Resources
Section titled “10. Summary and Related Resources”You built a complete RAG pipeline in Python: six functions, no framework dependencies, and a working system that answers questions about your documents. The pipeline loads documents, chunks them with overlap, generates embeddings, stores vectors in ChromaDB, retrieves relevant chunks by similarity, and generates grounded answers.
Key takeaways:
- RAG has two phases: offline indexing (load, chunk, embed, store) and online query (retrieve, generate)
- Chunking parameters (size and overlap) have more impact on quality than most engineers expect
- The system prompt is the primary control for grounding — it tells the LLM to use only the retrieved context
- Hybrid search, reranking, and multi-query retrieval address the three most common retrieval failure modes
- Evaluation requires a test set and four metrics: precision, recall, faithfulness, and relevancy
Related
Section titled “Related”- RAG Architecture and Production Patterns — Design concepts, chunking strategies, and production failure modes for RAG systems
- Vector Database Comparison — Compare Pinecone, Qdrant, Weaviate, ChromaDB, and FAISS for your RAG pipeline
- Fine-Tuning vs RAG — When to use RAG, when to fine-tune, and when to combine both approaches
- Python for GenAI Engineers — Production Python patterns including Pydantic, error handling, and async code
- Async Python for GenAI — Make your RAG pipeline async for parallel retrieval and streaming responses
- GenAI Engineer Interview Questions — RAG implementation questions appear in mid and senior-level technical rounds
Last updated: March 2026. Code examples use OpenAI Python SDK v1.x, ChromaDB v0.5.x, and PyPDF2 v3.x. The pipeline architecture applies regardless of which specific libraries or models you choose.
Frequently Asked Questions
What libraries do I need to build a RAG pipeline in Python?
A minimal RAG pipeline needs four libraries: openai for embeddings and LLM generation, chromadb for vector storage, PyPDF2 for document loading, and tiktoken for token counting. Install with pip install openai chromadb PyPDF2 tiktoken. For production systems, you may add LangChain for orchestration or sentence-transformers for local embedding models.
How do I choose the right chunk size for my RAG pipeline?
Start with 500-token chunks and 50-token overlap. Smaller chunks (200-300 tokens) improve retrieval precision but lose surrounding context. Larger chunks (800-1000 tokens) preserve context but dilute relevance signals. Test with your actual documents and queries — measure retrieval quality by checking whether the returned chunks contain the answer to your test questions.
Why does my RAG pipeline return irrelevant results?
The most common causes are: chunks that are too large (diluting the relevance signal), no overlap between chunks (splitting key information across boundaries), poor query formulation (the query phrasing does not match the document language), or insufficient top-k results. Try reducing chunk size, increasing overlap, rephrasing queries, or adding hybrid search with keyword matching alongside vector search.
Can I build a RAG pipeline without OpenAI?
Yes. Replace OpenAI embeddings with sentence-transformers running locally (e.g., all-MiniLM-L6-v2) and replace the OpenAI LLM with Ollama running a local model like Llama 3 or Mistral. ChromaDB works with any embedding model. The pipeline architecture stays identical — only the embedding and generation functions change.
How is this tutorial different from the RAG architecture guide?
The RAG architecture guide covers design concepts: what RAG is, architectural patterns, chunking strategies, hybrid search theory, and production failure modes. This tutorial is a hands-on Python implementation. You write and run every line of code: loading documents, chunking text, generating embeddings, storing vectors in ChromaDB, retrieving relevant chunks, and generating answers. Start here to build, then read the architecture guide to design production systems.