RAG Chunking Strategies — Semantic, Recursive & Agentic Chunking (2026)

Chunking is the single most underrated decision in any RAG system. Engineers spend days tuning embedding models and vector databases while leaving their chunking strategy at default — and wonder why retrieval quality is mediocre. This guide covers every chunking strategy used in production in 2026, with Python code and clear guidance on when to use each.

1. Who This Guide Is For

This guide is written for engineers who have a working RAG pipeline and want to understand chunking deeply — not just how to call a splitter, but why the strategy choice matters and how to benchmark it.

You will get the most from this guide if you:

Have built or are actively building a RAG pipeline
Understand what embeddings are and how vector similarity search works
Have chosen or are evaluating a vector database for your system
Want to move beyond “just use 512 tokens with 10% overlap” defaults
Are preparing for GenAI engineer interviews where chunking is a standard topic

If you are new to RAG, start with the RAG architecture guide first, then return here.

2. Why Chunking Matters More Than You Think

Chunking is not a preprocessing detail. It is an architectural decision that determines the upper bound of your retrieval quality. No amount of embedding model tuning or vector search optimization can recover information that was shredded or buried by a poor chunking strategy.

The core tension

Every chunking decision navigates a fundamental trade-off between two failure modes:

Chunks too large: The embedding captures an average of many topics. When you search for a specific detail, the similarity score gets diluted by unrelated content in the same chunk. The top-K retrieval returns chunks that are only partially relevant. Your LLM gets flooded with irrelevant context and either ignores it or hallucinates around it.

Chunks too small: Each chunk lacks the surrounding context needed to make sense of the information it contains. A sentence like “It supports up to 1 million rows” is meaningless without knowing what “it” refers to. Your LLM gets precise matches but cannot generate a coherent answer because each chunk is an orphaned fragment.

The ideal chunk is the smallest unit of text that is fully self-contained and answers a single, coherent question. That ideal differs by document type, query pattern, and embedding model — which is why no single default setting works for every system.

What the research shows

Controlled experiments on chunking strategies consistently show that:

Moving from fixed-size to recursive splitting improves retrieval recall by 5-12% on structured documents (code docs, API references, technical manuals)
Moving from fixed-size to semantic chunking improves retrieval accuracy by 10-20% on narrative documents (blog posts, research papers, support articles)
Adding 10-20% token overlap to any fixed-size strategy recovers 60-70% of the boundary-loss problem at minimal storage cost
Hierarchical chunking (small chunks for retrieval, parent chunks for context) outperforms flat chunking on complex multi-hop questions by 15-25%

These improvements compound. A system using semantic chunking + hierarchical retrieval can outperform a flat fixed-size system by 30-40% on retrieval-based evaluation metrics — with the same embedding model and vector database.

3. Fixed-Size Chunking

Fixed-size chunking is the simplest strategy: split the document every N characters or tokens, optionally with an overlap window. It requires no NLP dependencies, runs in milliseconds, and is deterministic. Every engineer starts here.

Character-based splitting

# Requires: langchain-text-splitters>=0.3.0
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n\n",          # Try to split on double newlines first
    chunk_size=1000,           # Characters per chunk
    chunk_overlap=200,         # Characters of overlap between chunks
    length_function=len,       # Measure by character count
)

with open("document.txt") as f:
    text = f.read()

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(f"First chunk preview: {chunks[0][:200]}")

Character-based splitting is fast but brittle. It will split mid-sentence whenever the character budget runs out, regardless of whether the separator appears. It works acceptably for clean, uniformly structured plain text. It fails badly on mixed-format documents.

Token-based splitting

Embedding models operate on tokens, not characters. A character count does not accurately predict the number of tokens a chunk will consume. For production systems, measure chunk size in tokens to avoid accidentally exceeding your embedding model’s maximum input length (typically 512 or 8,192 tokens depending on the model).

# Requires: tiktoken>=0.7.0, langchain-text-splitters>=0.3.0
import tiktoken
from langchain_text_splitters import TokenTextSplitter

# Use the same tokenizer as your embedding model
encoding = tiktoken.get_encoding("cl100k_base")  # OpenAI models

splitter = TokenTextSplitter(
    encoding_name="cl100k_base",
    chunk_size=512,       # Tokens per chunk
    chunk_overlap=64,     # Token overlap (~12%)
)

chunks = splitter.split_text(text)

# Verify token counts
for i, chunk in enumerate(chunks[:3]):
    token_count = len(encoding.encode(chunk))
    print(f"Chunk {i}: {token_count} tokens, {len(chunk)} chars")

Sentence splitting

The simplest upgrade from pure fixed-size chunking: split on sentence boundaries first, then group sentences until you approach your token budget. This avoids mid-sentence breaks without requiring embedding-based boundary detection.

# Requires: spacy>=3.7.0, langchain-text-splitters>=0.3.0
import spacy
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# spaCy for sentence detection
nlp = spacy.load("en_core_web_sm")

def sentence_aware_chunks(text: str, max_tokens: int = 512, overlap_sentences: int = 2) -> list[str]:
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]

    chunks = []
    current_chunk = []
    current_tokens = 0

    for sentence in sentences:
        sentence_tokens = len(sentence.split())  # rough token estimate
        if current_tokens + sentence_tokens > max_tokens and current_chunk:
            chunks.append(" ".join(current_chunk))
            # Keep last N sentences as overlap
            current_chunk = current_chunk[-overlap_sentences:]
            current_tokens = sum(len(s.split()) for s in current_chunk)
        current_chunk.append(sentence)
        current_tokens += sentence_tokens

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

When to use fixed-size chunking: Prototypes, uniform plain-text documents, when you need deterministic chunk sizes, or when indexing speed matters more than retrieval quality.

When to upgrade: When you see retrieval returning chunks that cut off mid-thought, or when your evaluation metrics show context relevance below 0.7.

4. Recursive Character Splitting

Recursive character splitting is the default strategy in LangChain and the most widely used approach in production RAG systems. It improves on plain character splitting by trying a hierarchy of separators in order, falling back to the next when the preferred separator is not present.

How it works

The splitter maintains a priority list of separators: ["\n\n", "\n", ". ", " ", ""]. It first tries to split on double newlines (paragraph breaks). If a resulting piece is still too large, it tries single newlines. Then sentence-ending periods. Then spaces. The raw character split is a last resort.

This means the splitter naturally respects the document’s own structure — it splits between paragraphs before it splits between sentences, and between sentences before it splits mid-word.

# Requires: langchain-text-splitters>=0.3.0
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=[
        "\n\n",    # Paragraph breaks (first preference)
        "\n",      # Line breaks
        ". ",      # Sentence endings
        "! ",
        "? ",
        " ",       # Word boundaries
        "",        # Characters (last resort)
    ],
)

chunks = splitter.split_documents(documents)

Language-aware separators

For structured content formats, customize the separator hierarchy to match the format:

# Requires: langchain-text-splitters>=0.3.0
# For Markdown documents
from langchain_text_splitters import MarkdownTextSplitter

md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)

# For Python code
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,    # Code chunks can be larger — denser information
    chunk_overlap=200,
)

# For HTML
html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML,
    chunk_size=1000,
    chunk_overlap=100,
)

The language-aware variants use syntax-aware separators. For Python, it splits on class and function definitions before falling back to newlines and then characters. This keeps functions and class methods intact as single chunks — critical for code Q&A systems where splitting a function mid-body makes the chunk useless.

Adding metadata to chunks

Raw text chunks without provenance are difficult to trace back to their source. Always attach metadata at chunking time:

# Requires: langchain-core>=0.3.0, langchain-text-splitters>=0.3.0
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)

def chunk_with_metadata(file_path: str, source_url: str) -> list[Document]:
    with open(file_path) as f:
        raw_text = f.read()

    base_doc = Document(
        page_content=raw_text,
        metadata={
            "source": source_url,
            "file_path": file_path,
        }
    )

    chunks = splitter.split_documents([base_doc])

    # Enrich each chunk with its position in the source document
    for i, chunk in enumerate(chunks):
        chunk.metadata["chunk_index"] = i
        chunk.metadata["total_chunks"] = len(chunks)

    return chunks

When to use recursive splitting: The default choice for most production systems. It handles mixed-format documents gracefully, respects natural document structure, and requires no additional dependencies. Use it until you have evidence from evaluation metrics that a more sophisticated strategy is worth the added complexity.

5. Chunking Strategy Progression

The diagram below shows how chunking approaches build on each other in complexity and quality. Most production teams start at Fixed-Size, move to Recursive when they need better structure awareness, then add Semantic splitting when evaluation data shows retrieval gaps.

📊 Visual Explanation

RAG Chunking Strategy Progression

Each stage adds quality at the cost of complexity and compute. Upgrade based on retrieval evaluation metrics, not instinct.

Fixed-SizeSplit every N tokens

Character / token budget

Simple, fast, deterministic

Optional overlap window

Recovers boundary loss

No NLP dependencies

Ships in minutes

Use for: prototypes

Uniform plain-text corpora

RecursiveRespect document structure

Separator hierarchy

\n\n → \n → . → space

Format-aware variants

Markdown, Python, HTML

Preserves paragraphs

Low cost upgrade

Use for: most production

Mixed-format documents

SemanticSplit on topic boundaries

Embedding similarity

Detects topic shifts

Variable chunk sizes

Adapts to content density

10-20% better recall

Over fixed-size baseline

Use for: narrative content

Research, support docs

AgenticLLM-decided boundaries

Content-aware splitting

LLM reads each section

Best boundary quality

Highest retrieval precision

Expensive indexing

LLM call per document

Use for: high-value docs

Static, curated corpora

Idle

6. Semantic Chunking

Semantic chunking replaces rule-based splits with embedding-based boundary detection. Instead of splitting every N tokens, it splits when the meaning of the text changes — when the embedding similarity between adjacent sentences drops below a threshold.

How it works

The algorithm embeds each sentence independently. It then computes the cosine similarity between each consecutive sentence pair. When similarity drops sharply — a “semantic break point” — the algorithm creates a chunk boundary. Sentences on either side of the break belong to different topics and should not be in the same chunk.

# Requires: langchain-experimental>=0.3.0, langchain-openai>=0.2.0
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Percentile threshold: split at the bottom 85th percentile similarity scores
chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85,  # Higher = fewer, larger chunks
)

with open("technical_doc.txt") as f:
    text = f.read()

chunks = chunker.split_text(text)

print(f"Created {len(chunks)} semantic chunks")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i} ({len(chunk)} chars):")
    print(chunk[:300])

Threshold types

LangChain’s SemanticChunker supports three threshold types:

Threshold Type	Behavior	Best For
`percentile`	Split at sentences where similarity is below the Nth percentile	Documents with consistent topic density
`standard_deviation`	Split when similarity drops more than N standard deviations from the mean	Documents with variable topic density
`interquartile`	Split based on the IQR of similarity scores	Robust to outliers, good general default

# Requires: langchain-experimental>=0.3.0, langchain-openai>=0.2.0
# Standard deviation approach — good for technical docs with dense information
chunker_std = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.5,  # Split at 1.5 std devs below mean
)

# Interquartile approach — robust general default
chunker_iqr = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=1.5,
)

Hierarchical chunking: combine small and large

Semantic chunking pairs naturally with a hierarchical retrieval pattern. Index small chunks (precise retrieval) but return the parent chunk (rich context) to the LLM. This is the most effective approach for complex question answering.

# Requires: langchain>=0.3.0, langchain-experimental>=0.3.0, langchain-openai>=0.2.0, langchain-community>=0.3.0
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_experimental.text_splitter import SemanticChunker
from langchain_text_splitters import RecursiveCharacterTextSplitter

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Child splitter: small chunks for precise retrieval
child_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85,
)

# Parent splitter: larger chunks for context delivery to LLM
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)

# Vector store indexes child chunks
vectorstore = Chroma(embedding_function=embeddings)

# Document store holds parent chunks
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents — both child and parent chunks are created automatically
retriever.add_documents(documents)

# Retrieval returns parent chunks even though child chunks matched
results = retriever.invoke("What are the authentication requirements?")

When to use semantic chunking: Narrative documents (blog posts, research papers, support articles, PDFs), when your embedding-based evaluation shows low context relevance scores, or when document topics shift frequently within a single page.

Cost consideration: Semantic chunking requires one embedding call per sentence during the chunking phase — this adds cost at indexing time but not at query time. For a 100,000-word corpus (roughly 700,000 sentences), expect approximately 700 embedding API calls, costing roughly $0.02 with text-embedding-3-small. Acceptable for most production systems.

7. Agentic Chunking

Agentic chunking delegates the splitting decision to an LLM. Rather than computing embedding similarity or counting tokens, the LLM reads the document and decides where to draw chunk boundaries — and optionally generates a proposition-style summary for each chunk.

When agentic chunking is worth it

Agentic chunking is expensive: it requires at least one LLM call per document section during indexing. It is slow: even fast models add latency at indexing time. It is non-deterministic: the same document may produce different chunks on different runs.

Despite these costs, agentic chunking produces the highest-quality chunks for specific use cases:

High-value, static corpora — legal contracts, technical specifications, pharmaceutical documentation — where chunking quality directly affects business outcomes and documents change infrequently
Complex, multi-topic documents — annual reports, research papers, policy documents — where topic boundaries are subtle and rule-based splitting fails
When you have retrieval quality benchmarks showing that simpler strategies are consistently below acceptable thresholds

Proposition-based chunking

The most effective agentic approach converts each document passage into a set of propositions — self-contained, atomic factual statements. Each proposition becomes its own chunk. This was introduced in the Dense X Retrieval paper and produces chunks that are highly specific and self-contained.

# Requires: openai>=1.50.0
from openai import OpenAI

client = OpenAI()

PROPOSITION_PROMPT = """
You are an expert at decomposing text into atomic, self-contained propositions.

Given the following document passage, extract all factual claims as a list of
propositions. Each proposition must:
1. Be a single, complete sentence
2. Be fully self-contained (no pronouns referring to outside context)
3. Contain exactly one fact or claim
4. Be verifiable on its own without reading surrounding text

Return ONLY a JSON array of proposition strings. No explanations.

Passage:
{passage}
"""

def extract_propositions(passage: str) -> list[str]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Use a fast, cheap model for chunking
        messages=[
            {"role": "user", "content": PROPOSITION_PROMPT.format(passage=passage)}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )

    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("propositions", [])

# Example usage
passage = """
LangChain's RecursiveCharacterTextSplitter is the recommended default for most
RAG applications. It splits text using a hierarchy of separators, starting with
double newlines and falling back to single newlines, then periods, then spaces.
Chunk size defaults to 1000 characters with 200 characters of overlap. The
overlap prevents loss of context at chunk boundaries.
"""

propositions = extract_propositions(passage)
for prop in propositions:
    print(f"- {prop}")
# Output:
# - LangChain's RecursiveCharacterTextSplitter is recommended as the default for most RAG applications.
# - RecursiveCharacterTextSplitter splits text using a hierarchy of separators.
# - The separator hierarchy starts with double newlines and falls back to single newlines, then periods, then spaces.
# - The default chunk size for RecursiveCharacterTextSplitter is 1000 characters.
# - The default overlap is 200 characters.
# - The overlap prevents loss of context at chunk boundaries.

Agentic chunking with boundary detection

An alternative agentic approach keeps passages intact but uses the LLM to determine where one topic ends and another begins:

# Requires: openai>=1.50.0
BOUNDARY_PROMPT = """
You are reading a document section by section. Your task is to identify whether
the current section should be grouped with the previous section (same topic) or
starts a new chunk (topic shift).

Previous section:
{previous}

Current section:
{current}

Should these be in the same chunk?
Respond with JSON: {{"same_chunk": true/false, "reason": "brief explanation"}}
"""

def agentic_chunk(sections: list[str], model: str = "gpt-4o-mini") -> list[str]:
    """Group document sections into chunks using LLM boundary detection."""
    if not sections:
        return []

    chunks = []
    current_chunk = [sections[0]]

    for i in range(1, len(sections)):
        response = client.chat.completions.create(
            model=model,
            messages=[{
                "role": "user",
                "content": BOUNDARY_PROMPT.format(
                    previous=current_chunk[-1],
                    current=sections[i]
                )
            }],
            response_format={"type": "json_object"},
            temperature=0,
        )
        result = json.loads(response.choices[0].message.content)

        if result.get("same_chunk", True):
            current_chunk.append(sections[i])
        else:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [sections[i]]

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))

    return chunks

Cost estimate for agentic chunking: A 10,000-word document split into ~50 sections requires ~~50 LLM calls. At gpt-4o-mini pricing (~~$0.15/1M input tokens), a 10,000-word document costs roughly $0.01-0.03 to chunk agentically. For a 10,000-document corpus, that is $100-300 — substantial, but potentially justified for high-value document types.

8. Choosing the Right Strategy

No single strategy wins across all document types and query patterns. The following decision matrix maps document characteristics to the appropriate chunking strategy.

Decision matrix

Document Type	Query Pattern	Recommended Strategy	Chunk Size
Technical docs, API references	Specific fact lookups	Recursive (code-aware)	256-512 tokens
Markdown blog posts	Concept explanations	Recursive + sentence overlap	512-1024 tokens
PDF research papers	Multi-step reasoning	Semantic + hierarchical	512 child / 2048 parent
Legal contracts	Clause-specific questions	Agentic / proposition-based	Varies
Support ticket history	Problem-solution pairs	Sentence splitter	128-256 tokens
Code repositories	Function / class lookup	Language-aware recursive	1024-2048 tokens
Chat transcripts	Topic-based queries	Semantic chunking	256-512 tokens
Product manuals	Procedure lookups	Recursive + metadata	512-1024 tokens

The benchmarking protocol

Do not choose a chunking strategy based on intuition or blog post recommendations. Choose it based on measurement. The evaluation process for chunking strategy selection:

Step 1: Build a golden dataset. Create 20-50 question-answer pairs representative of your actual user queries. Each question must have a specific, verifiable answer that exists in your documents.

Step 2: Index with each strategy. Run your full indexing pipeline with each candidate chunking strategy. Keep all other variables constant (same embedding model, same vector store, same retrieval top-K).

Step 3: Measure retrieval metrics. For each question, retrieve the top-5 chunks. Measure:

Recall@5 — does the correct answer appear in the top 5 chunks?
Context relevance — are the retrieved chunks actually about the question topic?
Context precision — what fraction of retrieved chunks are relevant?

# Requires: langchain-core>=0.3.0
def evaluate_chunking_strategy(
    qa_pairs: list[dict],
    retriever,
    top_k: int = 5
) -> dict:
    """
    qa_pairs: [{"question": "...", "ground_truth": "..."}]
    retriever: a configured LangChain retriever
    """
    recall_hits = 0
    relevance_scores = []

    for pair in qa_pairs:
        question = pair["question"]
        ground_truth = pair["ground_truth"].lower()

        retrieved = retriever.invoke(question)[:top_k]

        # Recall: does retrieved context contain the answer?
        combined = " ".join(doc.page_content for doc in retrieved).lower()
        hit = ground_truth[:50] in combined  # match first 50 chars of answer
        recall_hits += int(hit)

        # Simple relevance proxy: fraction of chunks mentioning key terms
        key_terms = question.lower().split()[:5]
        relevant = sum(
            1 for doc in retrieved
            if any(term in doc.page_content.lower() for term in key_terms)
        )
        relevance_scores.append(relevant / top_k)

    return {
        "recall_at_k": recall_hits / len(qa_pairs),
        "avg_context_relevance": sum(relevance_scores) / len(relevance_scores),
        "total_questions": len(qa_pairs),
    }

Step 4: Make the decision. Pick the strategy with the best recall@5 and context relevance scores. If multiple strategies tie, prefer the simpler one. Complexity is a maintenance cost.

Context window sizing

Your context window strategy interacts directly with chunking. The number of chunks you retrieve multiplied by your average chunk size must fit within your LLM’s context window alongside your system prompt and the user query. A common production configuration:

Context window: 128K tokens (GPT-4o, Claude 3.5 Sonnet)
System prompt: ~1,000 tokens
User query: ~200 tokens
Available for retrieved context: ~120,000 tokens
Chunk size: 1,024 tokens
Maximum retrievable chunks: ~117 chunks

Most production systems retrieve top-5 to top-10 chunks, leaving significant headroom. If you need more than 20 chunks to reliably answer questions, your chunks are probably too small and you should consider hierarchical chunking instead.

9. Interview Preparation

Chunking is a consistent topic in GenAI engineer interviews. Interviewers use it to assess whether you understand RAG at the implementation level, not just the concept level.

Question 1: “How would you approach chunking for a 10,000-document legal contract corpus?”

Strong answer: Start by understanding the query patterns. Legal Q&A typically requires precise clause lookup (“what does Section 4.2 say about indemnification?”) rather than broad topic retrieval. This favors smaller, precise chunks.

For legal documents specifically, I would use a proposition-based agentic approach: have an LLM convert each clause into self-contained factual propositions. Attach metadata including the contract ID, section number, clause type, and effective date. This supports both semantic search (finding relevant clauses) and metadata filtering (restricting to a specific contract or date range). The indexing cost is justified because legal contracts change infrequently and the quality impact is high.

Question 2: “A user reports that your RAG system returns results about the right topic but gives wrong factual details. What is the most likely chunking problem?”

Strong answer: This symptom — right topic, wrong details — usually indicates chunks that are too large. When a chunk contains multiple facts, the embedding captures a blend of all of them. The retrieval correctly identifies the chunk as topically relevant, but the LLM cannot reliably identify which specific fact within the chunk answers the question — especially if the chunk contains contradicting or evolving information (e.g., “Previously X, but since 2025 Y”).

The fix is to reduce chunk size or switch to proposition-based chunking. I would also measure context precision: if relevant chunks consistently contain a mix of relevant and irrelevant sentences, the chunks need to be smaller or more precisely bounded.

Question 3: “What is the difference between chunk size and chunk overlap, and how do you tune each?”

Strong answer: Chunk size determines how much information each chunk holds — the granularity of your indexing. Chunk overlap is a safety margin that ensures information near chunk boundaries appears in both adjacent chunks, preventing queries from missing answers that happen to sit at a boundary.

Tune chunk size based on your query type and document structure. Specific fact questions need smaller chunks (256-512 tokens). Summaries and broad questions need larger chunks (1024-2048 tokens). Match chunk size to the granularity of your expected answers.

Tune overlap based on how costly missed boundaries are. A 10-20% overlap (e.g., 100 tokens of overlap on a 512-token chunk) recovers most boundary-loss problems. Going above 25% overlap adds storage cost and retrieval noise without proportional quality gains, because you are now returning nearly duplicate content in adjacent chunks.

Question 4: “How do you know if your chunking strategy is actually working?”

Strong answer: By running retrieval evaluation before and after changing the chunking strategy. I use a golden dataset of 30-50 representative question-answer pairs. I measure recall@5 (does the correct answer appear in the top 5 retrieved chunks?), context relevance (are retrieved chunks about the right topic?), and context precision (fraction of retrieved chunks that are actually useful).

I also watch for second-order signals in production: if users frequently rephrase the same question, it suggests the first retrieval missed relevant context — often a chunking boundary problem. If answers are consistently “hallucinated but plausible,” it suggests the LLM is not finding grounding facts — often a chunk-too-large problem diluting the signal.

10. Next Steps

Chunking is the foundation. Once your chunking strategy is tuned and validated against retrieval benchmarks, the next layer of RAG quality improvements comes from:

Embedding model selection — the quality ceiling on semantic similarity. See the embeddings guide for model comparisons.
Hybrid search — combining dense vector search with sparse BM25 to capture both semantic and keyword relevance. Covered in the RAG architecture guide.
Vector database configuration — index type, HNSW parameters, and filtering strategies. See the vector database comparison.
RAG evaluation — systematic measurement of retrieval and generation quality using RAGAS metrics. The evaluation guide covers this in depth.

RAG Architecture — Full production RAG pipeline design
Advanced RAG — Hybrid search, reranking, and query transforms
RAG Evaluation — RAGAS metrics and retrieval benchmarks
Embeddings Explained — Text-to-vector models for semantic search
Vector DB Comparison — Pinecone vs Qdrant vs Weaviate

Frequently Asked Questions

What is chunking in RAG?

Chunking is the process of splitting documents into smaller pieces (chunks) before embedding and storing them in a vector database. It is a critical step in RAG because chunk size and strategy directly affect retrieval quality. Too large chunks dilute relevance signals. Too small chunks lose context.

What is semantic chunking?

Semantic chunking splits documents based on meaning rather than fixed character counts. It uses embedding similarity to detect topic boundaries — when the similarity between consecutive sentences drops below a threshold, it creates a new chunk. Semantic chunking typically improves retrieval accuracy by 10-20% over fixed-size chunking.

What chunk size should I use for RAG?

Start with 512-1024 tokens with 10-20% overlap as a baseline. For technical documentation, use 256-512 tokens to keep concepts focused. For narrative content, use 1024-2048 tokens to preserve context. The optimal size depends on your embedding model, query type, and LLM context window. Always benchmark with your actual data.

What is agentic chunking?

Agentic chunking uses an LLM to decide how to split documents. The LLM reads each section and determines optimal chunk boundaries based on semantic coherence, topic completeness, and self-containedness. It produces the highest-quality chunks but is expensive (requires an LLM call per document) and slow. Best used for high-value, static document collections.

What is recursive character splitting?

Recursive character splitting tries a hierarchy of separators in order: paragraph breaks, line breaks, sentence endings, spaces, and raw characters. It naturally respects the document's own structure by splitting between paragraphs before splitting between sentences. It is the default strategy in LangChain and the most widely used approach in production RAG systems.

What is hierarchical chunking in RAG?

Hierarchical chunking indexes small child chunks for precise retrieval but returns the larger parent chunk to the LLM for richer context. This pairs naturally with semantic chunking and outperforms flat chunking on complex multi-hop questions by 15-25%. It is the most effective approach for complex question answering in advanced RAG pipelines.

How does chunk overlap work and why is it important?

Chunk overlap is a safety margin that ensures information near chunk boundaries appears in both adjacent chunks. A 10-20% overlap recovers 60-70% of the boundary-loss problem at minimal storage cost. Going above 25% overlap adds storage cost and retrieval noise without proportional quality gains.

How do I benchmark different chunking strategies?

Create a golden dataset of 20-50 question-answer pairs representative of actual user queries. Index with each candidate chunking strategy while keeping all other variables constant. Measure recall@5, context relevance, and context precision for each strategy. See the RAG evaluation guide for detailed metrics.

When should I upgrade from fixed-size to semantic chunking?

Upgrade when retrieval evaluation shows chunks that cut off mid-thought or when context relevance scores fall below 0.7. Semantic chunking is especially effective for narrative documents like blog posts, research papers, and support articles where topics shift frequently within a single page.

What is proposition-based chunking?

Proposition-based chunking converts each document passage into a set of self-contained, atomic factual statements using an LLM. Each proposition becomes its own chunk containing exactly one verifiable fact. This was introduced in the Dense X Retrieval paper and produces highly specific, self-contained chunks ideal for high-value static corpora.

RAG Chunking Strategies — Semantic, Recursive & Agentic Chunking (2026)

1. Who This Guide Is For

2. Why Chunking Matters More Than You Think

The core tension

What the research shows

3. Fixed-Size Chunking

Character-based splitting

Token-based splitting

Sentence splitting

4. Recursive Character Splitting

How it works

Language-aware separators

Adding metadata to chunks

5. Chunking Strategy Progression

📊 Visual Explanation

6. Semantic Chunking

How it works

Threshold types

Hierarchical chunking: combine small and large

7. Agentic Chunking

When agentic chunking is worth it

Proposition-based chunking

Agentic chunking with boundary detection

8. Choosing the Right Strategy

Decision matrix

The benchmarking protocol

Context window sizing

9. Interview Preparation

Question 1: “How would you approach chunking for a 10,000-document legal contract corpus?”

Question 2: “A user reports that your RAG system returns results about the right topic but gives wrong factual details. What is the most likely chunking problem?”

Question 3: “What is the difference between chunk size and chunk overlap, and how do you tune each?”

Question 4: “How do you know if your chunking strategy is actually working?”

10. Next Steps

Related

Frequently Asked Questions