Skip to content

RAG Chunking Strategies — Semantic, Recursive & Agentic Chunking (2026)

Chunking is the single most underrated decision in any RAG system. Engineers spend days tuning embedding models and vector databases while leaving their chunking strategy at default — and wonder why retrieval quality is mediocre. This guide covers every chunking strategy used in production in 2026, with Python code and clear guidance on when to use each.

This guide is written for engineers who have a working RAG pipeline and want to understand chunking deeply — not just how to call a splitter, but why the strategy choice matters and how to benchmark it.

You will get the most from this guide if you:

  • Have built or are actively building a RAG pipeline
  • Understand what embeddings are and how vector similarity search works
  • Have chosen or are evaluating a vector database for your system
  • Want to move beyond “just use 512 tokens with 10% overlap” defaults
  • Are preparing for GenAI engineer interviews where chunking is a standard topic

If you are new to RAG, start with the RAG architecture guide first, then return here.


2. Why Chunking Matters More Than You Think

Section titled “2. Why Chunking Matters More Than You Think”

Chunking is not a preprocessing detail. It is an architectural decision that determines the upper bound of your retrieval quality. No amount of embedding model tuning or vector search optimization can recover information that was shredded or buried by a poor chunking strategy.

Every chunking decision navigates a fundamental trade-off between two failure modes:

Chunks too large: The embedding captures an average of many topics. When you search for a specific detail, the similarity score gets diluted by unrelated content in the same chunk. The top-K retrieval returns chunks that are only partially relevant. Your LLM gets flooded with irrelevant context and either ignores it or hallucinates around it.

Chunks too small: Each chunk lacks the surrounding context needed to make sense of the information it contains. A sentence like “It supports up to 1 million rows” is meaningless without knowing what “it” refers to. Your LLM gets precise matches but cannot generate a coherent answer because each chunk is an orphaned fragment.

The ideal chunk is the smallest unit of text that is fully self-contained and answers a single, coherent question. That ideal differs by document type, query pattern, and embedding model — which is why no single default setting works for every system.

Controlled experiments on chunking strategies consistently show that:

  • Moving from fixed-size to recursive splitting improves retrieval recall by 5-12% on structured documents (code docs, API references, technical manuals)
  • Moving from fixed-size to semantic chunking improves retrieval accuracy by 10-20% on narrative documents (blog posts, research papers, support articles)
  • Adding 10-20% token overlap to any fixed-size strategy recovers 60-70% of the boundary-loss problem at minimal storage cost
  • Hierarchical chunking (small chunks for retrieval, parent chunks for context) outperforms flat chunking on complex multi-hop questions by 15-25%

These improvements compound. A system using semantic chunking + hierarchical retrieval can outperform a flat fixed-size system by 30-40% on retrieval-based evaluation metrics — with the same embedding model and vector database.


Fixed-size chunking is the simplest strategy: split the document every N characters or tokens, optionally with an overlap window. It requires no NLP dependencies, runs in milliseconds, and is deterministic. Every engineer starts here.

# Requires: langchain-text-splitters>=0.3.0
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
separator="\n\n", # Try to split on double newlines first
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Characters of overlap between chunks
length_function=len, # Measure by character count
)
with open("document.txt") as f:
text = f.read()
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(f"First chunk preview: {chunks[0][:200]}")

Character-based splitting is fast but brittle. It will split mid-sentence whenever the character budget runs out, regardless of whether the separator appears. It works acceptably for clean, uniformly structured plain text. It fails badly on mixed-format documents.

Embedding models operate on tokens, not characters. A character count does not accurately predict the number of tokens a chunk will consume. For production systems, measure chunk size in tokens to avoid accidentally exceeding your embedding model’s maximum input length (typically 512 or 8,192 tokens depending on the model).

# Requires: tiktoken>=0.7.0, langchain-text-splitters>=0.3.0
import tiktoken
from langchain_text_splitters import TokenTextSplitter
# Use the same tokenizer as your embedding model
encoding = tiktoken.get_encoding("cl100k_base") # OpenAI models
splitter = TokenTextSplitter(
encoding_name="cl100k_base",
chunk_size=512, # Tokens per chunk
chunk_overlap=64, # Token overlap (~12%)
)
chunks = splitter.split_text(text)
# Verify token counts
for i, chunk in enumerate(chunks[:3]):
token_count = len(encoding.encode(chunk))
print(f"Chunk {i}: {token_count} tokens, {len(chunk)} chars")

The simplest upgrade from pure fixed-size chunking: split on sentence boundaries first, then group sentences until you approach your token budget. This avoids mid-sentence breaks without requiring embedding-based boundary detection.

# Requires: spacy>=3.7.0, langchain-text-splitters>=0.3.0
import spacy
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
# spaCy for sentence detection
nlp = spacy.load("en_core_web_sm")
def sentence_aware_chunks(text: str, max_tokens: int = 512, overlap_sentences: int = 2) -> list[str]:
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
chunks = []
current_chunk = []
current_tokens = 0
for sentence in sentences:
sentence_tokens = len(sentence.split()) # rough token estimate
if current_tokens + sentence_tokens > max_tokens and current_chunk:
chunks.append(" ".join(current_chunk))
# Keep last N sentences as overlap
current_chunk = current_chunk[-overlap_sentences:]
current_tokens = sum(len(s.split()) for s in current_chunk)
current_chunk.append(sentence)
current_tokens += sentence_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks

When to use fixed-size chunking: Prototypes, uniform plain-text documents, when you need deterministic chunk sizes, or when indexing speed matters more than retrieval quality.

When to upgrade: When you see retrieval returning chunks that cut off mid-thought, or when your evaluation metrics show context relevance below 0.7.


Recursive character splitting is the default strategy in LangChain and the most widely used approach in production RAG systems. It improves on plain character splitting by trying a hierarchy of separators in order, falling back to the next when the preferred separator is not present.

The splitter maintains a priority list of separators: ["\n\n", "\n", ". ", " ", ""]. It first tries to split on double newlines (paragraph breaks). If a resulting piece is still too large, it tries single newlines. Then sentence-ending periods. Then spaces. The raw character split is a last resort.

This means the splitter naturally respects the document’s own structure — it splits between paragraphs before it splits between sentences, and between sentences before it splits mid-word.

# Requires: langchain-text-splitters>=0.3.0
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=[
"\n\n", # Paragraph breaks (first preference)
"\n", # Line breaks
". ", # Sentence endings
"! ",
"? ",
" ", # Word boundaries
"", # Characters (last resort)
],
)
chunks = splitter.split_documents(documents)

For structured content formats, customize the separator hierarchy to match the format:

# Requires: langchain-text-splitters>=0.3.0
# For Markdown documents
from langchain_text_splitters import MarkdownTextSplitter
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
# For Python code
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000, # Code chunks can be larger — denser information
chunk_overlap=200,
)
# For HTML
html_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.HTML,
chunk_size=1000,
chunk_overlap=100,
)

The language-aware variants use syntax-aware separators. For Python, it splits on class and function definitions before falling back to newlines and then characters. This keeps functions and class methods intact as single chunks — critical for code Q&A systems where splitting a function mid-body makes the chunk useless.

Raw text chunks without provenance are difficult to trace back to their source. Always attach metadata at chunking time:

# Requires: langchain-core>=0.3.0, langchain-text-splitters>=0.3.0
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
def chunk_with_metadata(file_path: str, source_url: str) -> list[Document]:
with open(file_path) as f:
raw_text = f.read()
base_doc = Document(
page_content=raw_text,
metadata={
"source": source_url,
"file_path": file_path,
}
)
chunks = splitter.split_documents([base_doc])
# Enrich each chunk with its position in the source document
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_index"] = i
chunk.metadata["total_chunks"] = len(chunks)
return chunks

When to use recursive splitting: The default choice for most production systems. It handles mixed-format documents gracefully, respects natural document structure, and requires no additional dependencies. Use it until you have evidence from evaluation metrics that a more sophisticated strategy is worth the added complexity.


The diagram below shows how chunking approaches build on each other in complexity and quality. Most production teams start at Fixed-Size, move to Recursive when they need better structure awareness, then add Semantic splitting when evaluation data shows retrieval gaps.

RAG Chunking Strategy Progression

Each stage adds quality at the cost of complexity and compute. Upgrade based on retrieval evaluation metrics, not instinct.

Fixed-SizeSplit every N tokens
Character / token budget
Simple, fast, deterministic
Optional overlap window
Recovers boundary loss
No NLP dependencies
Ships in minutes
Use for: prototypes
Uniform plain-text corpora
RecursiveRespect document structure
Separator hierarchy
\n\n → \n → . → space
Format-aware variants
Markdown, Python, HTML
Preserves paragraphs
Low cost upgrade
Use for: most production
Mixed-format documents
SemanticSplit on topic boundaries
Embedding similarity
Detects topic shifts
Variable chunk sizes
Adapts to content density
10-20% better recall
Over fixed-size baseline
Use for: narrative content
Research, support docs
AgenticLLM-decided boundaries
Content-aware splitting
LLM reads each section
Best boundary quality
Highest retrieval precision
Expensive indexing
LLM call per document
Use for: high-value docs
Static, curated corpora
Idle

Semantic chunking replaces rule-based splits with embedding-based boundary detection. Instead of splitting every N tokens, it splits when the meaning of the text changes — when the embedding similarity between adjacent sentences drops below a threshold.

The algorithm embeds each sentence independently. It then computes the cosine similarity between each consecutive sentence pair. When similarity drops sharply — a “semantic break point” — the algorithm creates a chunk boundary. Sentences on either side of the break belong to different topics and should not be in the same chunk.

# Requires: langchain-experimental>=0.3.0, langchain-openai>=0.2.0
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Percentile threshold: split at the bottom 85th percentile similarity scores
chunker = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=85, # Higher = fewer, larger chunks
)
with open("technical_doc.txt") as f:
text = f.read()
chunks = chunker.split_text(text)
print(f"Created {len(chunks)} semantic chunks")
for i, chunk in enumerate(chunks[:3]):
print(f"\nChunk {i} ({len(chunk)} chars):")
print(chunk[:300])

LangChain’s SemanticChunker supports three threshold types:

Threshold TypeBehaviorBest For
percentileSplit at sentences where similarity is below the Nth percentileDocuments with consistent topic density
standard_deviationSplit when similarity drops more than N standard deviations from the meanDocuments with variable topic density
interquartileSplit based on the IQR of similarity scoresRobust to outliers, good general default
# Requires: langchain-experimental>=0.3.0, langchain-openai>=0.2.0
# Standard deviation approach — good for technical docs with dense information
chunker_std = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="standard_deviation",
breakpoint_threshold_amount=1.5, # Split at 1.5 std devs below mean
)
# Interquartile approach — robust general default
chunker_iqr = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="interquartile",
breakpoint_threshold_amount=1.5,
)

Hierarchical chunking: combine small and large

Section titled “Hierarchical chunking: combine small and large”

Semantic chunking pairs naturally with a hierarchical retrieval pattern. Index small chunks (precise retrieval) but return the parent chunk (rich context) to the LLM. This is the most effective approach for complex question answering.

# Requires: langchain>=0.3.0, langchain-experimental>=0.3.0, langchain-openai>=0.2.0, langchain-community>=0.3.0
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_experimental.text_splitter import SemanticChunker
from langchain_text_splitters import RecursiveCharacterTextSplitter
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Child splitter: small chunks for precise retrieval
child_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=85,
)
# Parent splitter: larger chunks for context delivery to LLM
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200,
)
# Vector store indexes child chunks
vectorstore = Chroma(embedding_function=embeddings)
# Document store holds parent chunks
docstore = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Add documents — both child and parent chunks are created automatically
retriever.add_documents(documents)
# Retrieval returns parent chunks even though child chunks matched
results = retriever.invoke("What are the authentication requirements?")

When to use semantic chunking: Narrative documents (blog posts, research papers, support articles, PDFs), when your embedding-based evaluation shows low context relevance scores, or when document topics shift frequently within a single page.

Cost consideration: Semantic chunking requires one embedding call per sentence during the chunking phase — this adds cost at indexing time but not at query time. For a 100,000-word corpus (roughly 700,000 sentences), expect approximately 700 embedding API calls, costing roughly $0.02 with text-embedding-3-small. Acceptable for most production systems.


Agentic chunking delegates the splitting decision to an LLM. Rather than computing embedding similarity or counting tokens, the LLM reads the document and decides where to draw chunk boundaries — and optionally generates a proposition-style summary for each chunk.

Agentic chunking is expensive: it requires at least one LLM call per document section during indexing. It is slow: even fast models add latency at indexing time. It is non-deterministic: the same document may produce different chunks on different runs.

Despite these costs, agentic chunking produces the highest-quality chunks for specific use cases:

  • High-value, static corpora — legal contracts, technical specifications, pharmaceutical documentation — where chunking quality directly affects business outcomes and documents change infrequently
  • Complex, multi-topic documents — annual reports, research papers, policy documents — where topic boundaries are subtle and rule-based splitting fails
  • When you have retrieval quality benchmarks showing that simpler strategies are consistently below acceptable thresholds

The most effective agentic approach converts each document passage into a set of propositions — self-contained, atomic factual statements. Each proposition becomes its own chunk. This was introduced in the Dense X Retrieval paper and produces chunks that are highly specific and self-contained.

# Requires: openai>=1.50.0
from openai import OpenAI
client = OpenAI()
PROPOSITION_PROMPT = """
You are an expert at decomposing text into atomic, self-contained propositions.
Given the following document passage, extract all factual claims as a list of
propositions. Each proposition must:
1. Be a single, complete sentence
2. Be fully self-contained (no pronouns referring to outside context)
3. Contain exactly one fact or claim
4. Be verifiable on its own without reading surrounding text
Return ONLY a JSON array of proposition strings. No explanations.
Passage:
{passage}
"""
def extract_propositions(passage: str) -> list[str]:
response = client.chat.completions.create(
model="gpt-4o-mini", # Use a fast, cheap model for chunking
messages=[
{"role": "user", "content": PROPOSITION_PROMPT.format(passage=passage)}
],
response_format={"type": "json_object"},
temperature=0,
)
import json
result = json.loads(response.choices[0].message.content)
return result.get("propositions", [])
# Example usage
passage = """
LangChain's RecursiveCharacterTextSplitter is the recommended default for most
RAG applications. It splits text using a hierarchy of separators, starting with
double newlines and falling back to single newlines, then periods, then spaces.
Chunk size defaults to 1000 characters with 200 characters of overlap. The
overlap prevents loss of context at chunk boundaries.
"""
propositions = extract_propositions(passage)
for prop in propositions:
print(f"- {prop}")
# Output:
# - LangChain's RecursiveCharacterTextSplitter is recommended as the default for most RAG applications.
# - RecursiveCharacterTextSplitter splits text using a hierarchy of separators.
# - The separator hierarchy starts with double newlines and falls back to single newlines, then periods, then spaces.
# - The default chunk size for RecursiveCharacterTextSplitter is 1000 characters.
# - The default overlap is 200 characters.
# - The overlap prevents loss of context at chunk boundaries.

An alternative agentic approach keeps passages intact but uses the LLM to determine where one topic ends and another begins:

# Requires: openai>=1.50.0
BOUNDARY_PROMPT = """
You are reading a document section by section. Your task is to identify whether
the current section should be grouped with the previous section (same topic) or
starts a new chunk (topic shift).
Previous section:
{previous}
Current section:
{current}
Should these be in the same chunk?
Respond with JSON: {{"same_chunk": true/false, "reason": "brief explanation"}}
"""
def agentic_chunk(sections: list[str], model: str = "gpt-4o-mini") -> list[str]:
"""Group document sections into chunks using LLM boundary detection."""
if not sections:
return []
chunks = []
current_chunk = [sections[0]]
for i in range(1, len(sections)):
response = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": BOUNDARY_PROMPT.format(
previous=current_chunk[-1],
current=sections[i]
)
}],
response_format={"type": "json_object"},
temperature=0,
)
result = json.loads(response.choices[0].message.content)
if result.get("same_chunk", True):
current_chunk.append(sections[i])
else:
chunks.append("\n\n".join(current_chunk))
current_chunk = [sections[i]]
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunks

Cost estimate for agentic chunking: A 10,000-word document split into ~50 sections requires 50 LLM calls. At gpt-4o-mini pricing ($0.15/1M input tokens), a 10,000-word document costs roughly $0.01-0.03 to chunk agentically. For a 10,000-document corpus, that is $100-300 — substantial, but potentially justified for high-value document types.


No single strategy wins across all document types and query patterns. The following decision matrix maps document characteristics to the appropriate chunking strategy.

Document TypeQuery PatternRecommended StrategyChunk Size
Technical docs, API referencesSpecific fact lookupsRecursive (code-aware)256-512 tokens
Markdown blog postsConcept explanationsRecursive + sentence overlap512-1024 tokens
PDF research papersMulti-step reasoningSemantic + hierarchical512 child / 2048 parent
Legal contractsClause-specific questionsAgentic / proposition-basedVaries
Support ticket historyProblem-solution pairsSentence splitter128-256 tokens
Code repositoriesFunction / class lookupLanguage-aware recursive1024-2048 tokens
Chat transcriptsTopic-based queriesSemantic chunking256-512 tokens
Product manualsProcedure lookupsRecursive + metadata512-1024 tokens

Do not choose a chunking strategy based on intuition or blog post recommendations. Choose it based on measurement. The evaluation process for chunking strategy selection:

Step 1: Build a golden dataset. Create 20-50 question-answer pairs representative of your actual user queries. Each question must have a specific, verifiable answer that exists in your documents.

Step 2: Index with each strategy. Run your full indexing pipeline with each candidate chunking strategy. Keep all other variables constant (same embedding model, same vector store, same retrieval top-K).

Step 3: Measure retrieval metrics. For each question, retrieve the top-5 chunks. Measure:

  • Recall@5 — does the correct answer appear in the top 5 chunks?
  • Context relevance — are the retrieved chunks actually about the question topic?
  • Context precision — what fraction of retrieved chunks are relevant?
# Requires: langchain-core>=0.3.0
def evaluate_chunking_strategy(
qa_pairs: list[dict],
retriever,
top_k: int = 5
) -> dict:
"""
qa_pairs: [{"question": "...", "ground_truth": "..."}]
retriever: a configured LangChain retriever
"""
recall_hits = 0
relevance_scores = []
for pair in qa_pairs:
question = pair["question"]
ground_truth = pair["ground_truth"].lower()
retrieved = retriever.invoke(question)[:top_k]
# Recall: does retrieved context contain the answer?
combined = " ".join(doc.page_content for doc in retrieved).lower()
hit = ground_truth[:50] in combined # match first 50 chars of answer
recall_hits += int(hit)
# Simple relevance proxy: fraction of chunks mentioning key terms
key_terms = question.lower().split()[:5]
relevant = sum(
1 for doc in retrieved
if any(term in doc.page_content.lower() for term in key_terms)
)
relevance_scores.append(relevant / top_k)
return {
"recall_at_k": recall_hits / len(qa_pairs),
"avg_context_relevance": sum(relevance_scores) / len(relevance_scores),
"total_questions": len(qa_pairs),
}

Step 4: Make the decision. Pick the strategy with the best recall@5 and context relevance scores. If multiple strategies tie, prefer the simpler one. Complexity is a maintenance cost.

Your context window strategy interacts directly with chunking. The number of chunks you retrieve multiplied by your average chunk size must fit within your LLM’s context window alongside your system prompt and the user query. A common production configuration:

  • Context window: 128K tokens (GPT-4o, Claude 3.5 Sonnet)
  • System prompt: ~1,000 tokens
  • User query: ~200 tokens
  • Available for retrieved context: ~120,000 tokens
  • Chunk size: 1,024 tokens
  • Maximum retrievable chunks: ~117 chunks

Most production systems retrieve top-5 to top-10 chunks, leaving significant headroom. If you need more than 20 chunks to reliably answer questions, your chunks are probably too small and you should consider hierarchical chunking instead.


Chunking is a consistent topic in GenAI engineer interviews. Interviewers use it to assess whether you understand RAG at the implementation level, not just the concept level.

Section titled “Question 1: “How would you approach chunking for a 10,000-document legal contract corpus?””

Strong answer: Start by understanding the query patterns. Legal Q&A typically requires precise clause lookup (“what does Section 4.2 say about indemnification?”) rather than broad topic retrieval. This favors smaller, precise chunks.

For legal documents specifically, I would use a proposition-based agentic approach: have an LLM convert each clause into self-contained factual propositions. Attach metadata including the contract ID, section number, clause type, and effective date. This supports both semantic search (finding relevant clauses) and metadata filtering (restricting to a specific contract or date range). The indexing cost is justified because legal contracts change infrequently and the quality impact is high.

Question 2: “A user reports that your RAG system returns results about the right topic but gives wrong factual details. What is the most likely chunking problem?”

Section titled “Question 2: “A user reports that your RAG system returns results about the right topic but gives wrong factual details. What is the most likely chunking problem?””

Strong answer: This symptom — right topic, wrong details — usually indicates chunks that are too large. When a chunk contains multiple facts, the embedding captures a blend of all of them. The retrieval correctly identifies the chunk as topically relevant, but the LLM cannot reliably identify which specific fact within the chunk answers the question — especially if the chunk contains contradicting or evolving information (e.g., “Previously X, but since 2025 Y”).

The fix is to reduce chunk size or switch to proposition-based chunking. I would also measure context precision: if relevant chunks consistently contain a mix of relevant and irrelevant sentences, the chunks need to be smaller or more precisely bounded.

Question 3: “What is the difference between chunk size and chunk overlap, and how do you tune each?”

Section titled “Question 3: “What is the difference between chunk size and chunk overlap, and how do you tune each?””

Strong answer: Chunk size determines how much information each chunk holds — the granularity of your indexing. Chunk overlap is a safety margin that ensures information near chunk boundaries appears in both adjacent chunks, preventing queries from missing answers that happen to sit at a boundary.

Tune chunk size based on your query type and document structure. Specific fact questions need smaller chunks (256-512 tokens). Summaries and broad questions need larger chunks (1024-2048 tokens). Match chunk size to the granularity of your expected answers.

Tune overlap based on how costly missed boundaries are. A 10-20% overlap (e.g., 100 tokens of overlap on a 512-token chunk) recovers most boundary-loss problems. Going above 25% overlap adds storage cost and retrieval noise without proportional quality gains, because you are now returning nearly duplicate content in adjacent chunks.

Question 4: “How do you know if your chunking strategy is actually working?”

Section titled “Question 4: “How do you know if your chunking strategy is actually working?””

Strong answer: By running retrieval evaluation before and after changing the chunking strategy. I use a golden dataset of 30-50 representative question-answer pairs. I measure recall@5 (does the correct answer appear in the top 5 retrieved chunks?), context relevance (are retrieved chunks about the right topic?), and context precision (fraction of retrieved chunks that are actually useful).

I also watch for second-order signals in production: if users frequently rephrase the same question, it suggests the first retrieval missed relevant context — often a chunking boundary problem. If answers are consistently “hallucinated but plausible,” it suggests the LLM is not finding grounding facts — often a chunk-too-large problem diluting the signal.


Chunking is the foundation. Once your chunking strategy is tuned and validated against retrieval benchmarks, the next layer of RAG quality improvements comes from:

  • Embedding model selection — the quality ceiling on semantic similarity. See the embeddings guide for model comparisons.
  • Hybrid search — combining dense vector search with sparse BM25 to capture both semantic and keyword relevance. Covered in the RAG architecture guide.
  • Vector database configuration — index type, HNSW parameters, and filtering strategies. See the vector database comparison.
  • RAG evaluation — systematic measurement of retrieval and generation quality using RAGAS metrics. The evaluation guide covers this in depth.

Frequently Asked Questions

What is chunking in RAG?

Chunking is the process of splitting documents into smaller pieces (chunks) before embedding and storing them in a vector database. It is a critical step in RAG because chunk size and strategy directly affect retrieval quality. Too large chunks dilute relevance signals. Too small chunks lose context.

What is semantic chunking?

Semantic chunking splits documents based on meaning rather than fixed character counts. It uses embedding similarity to detect topic boundaries — when the similarity between consecutive sentences drops below a threshold, it creates a new chunk. Semantic chunking typically improves retrieval accuracy by 10-20% over fixed-size chunking.

What chunk size should I use for RAG?

Start with 512-1024 tokens with 10-20% overlap as a baseline. For technical documentation, use 256-512 tokens to keep concepts focused. For narrative content, use 1024-2048 tokens to preserve context. The optimal size depends on your embedding model, query type, and LLM context window. Always benchmark with your actual data.

What is agentic chunking?

Agentic chunking uses an LLM to decide how to split documents. The LLM reads each section and determines optimal chunk boundaries based on semantic coherence, topic completeness, and self-containedness. It produces the highest-quality chunks but is expensive (requires an LLM call per document) and slow. Best used for high-value, static document collections.

What is recursive character splitting?

Recursive character splitting tries a hierarchy of separators in order: paragraph breaks, line breaks, sentence endings, spaces, and raw characters. It naturally respects the document's own structure by splitting between paragraphs before splitting between sentences. It is the default strategy in LangChain and the most widely used approach in production RAG systems.

What is hierarchical chunking in RAG?

Hierarchical chunking indexes small child chunks for precise retrieval but returns the larger parent chunk to the LLM for richer context. This pairs naturally with semantic chunking and outperforms flat chunking on complex multi-hop questions by 15-25%. It is the most effective approach for complex question answering in advanced RAG pipelines.

How does chunk overlap work and why is it important?

Chunk overlap is a safety margin that ensures information near chunk boundaries appears in both adjacent chunks. A 10-20% overlap recovers 60-70% of the boundary-loss problem at minimal storage cost. Going above 25% overlap adds storage cost and retrieval noise without proportional quality gains.

How do I benchmark different chunking strategies?

Create a golden dataset of 20-50 question-answer pairs representative of actual user queries. Index with each candidate chunking strategy while keeping all other variables constant. Measure recall@5, context relevance, and context precision for each strategy. See the RAG evaluation guide for detailed metrics.

When should I upgrade from fixed-size to semantic chunking?

Upgrade when retrieval evaluation shows chunks that cut off mid-thought or when context relevance scores fall below 0.7. Semantic chunking is especially effective for narrative documents like blog posts, research papers, and support articles where topics shift frequently within a single page.

What is proposition-based chunking?

Proposition-based chunking converts each document passage into a set of self-contained, atomic factual statements using an LLM. Each proposition becomes its own chunk containing exactly one verifiable fact. This was introduced in the Dense X Retrieval paper and produces highly specific, self-contained chunks ideal for high-value static corpora.