How to Chunk Documents for Better Retrieval
Why Chunking Matters
Embedding models produce a single fixed-length vector for each input text. When that input is a 5,000-word document covering authentication, database setup, deployment, and monitoring, the resulting vector is an average of all those topics. A query about "database connection limits" produces a vector that partially matches the document vector, but the match is weak because the document vector also represents three other unrelated topics. If that same database section were embedded as a separate chunk, the query vector would match it much more strongly because the chunk vector focuses entirely on the relevant topic.
Chunking quality has a larger impact on retrieval accuracy than most other pipeline decisions. A study by LlamaIndex found that switching from 1,024-token chunks to 256-token chunks improved recall@5 by 12% on their benchmark dataset. Conversely, chunks that are too small (under 100 tokens) performed worse because they lacked sufficient context for the embedding model to produce meaningful vectors and lacked sufficient content to be useful when included in an LLM prompt.
Step-by-Step Implementation
Before choosing a chunking strategy, examine representative documents from your corpus. Are they structured with clear headings and sections (documentation, articles)? Are they unstructured running text (transcripts, emails, chat logs)? Do they have natural short units (FAQ entries, support tickets)? The answer determines which chunking approach will work best. Structured content benefits from semantic chunking at section boundaries. Unstructured content typically needs fixed-size or recursive chunking.
There are four main approaches. Fixed-size chunking splits text at regular token intervals. Semantic chunking splits at natural boundaries like paragraphs and sections. Recursive chunking tries progressively smaller boundaries until chunks meet a target size. Parent-child chunking indexes small chunks for precision but returns larger parent contexts for completeness.
# Strategy 1: Fixed-size chunking
# Best for: uniform, unstructured text (transcripts, logs)
import tiktoken
def fixed_size_chunks(text: str, chunk_size: int = 400,
overlap: int = 50, model: str = "cl100k_base"):
enc = tiktoken.get_encoding(model)
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_size, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start += chunk_size - overlap
return chunks# Strategy 2: Semantic chunking (paragraph/section boundaries)
# Best for: structured documents with headings
import re
def semantic_chunks(text: str, max_tokens: int = 600,
min_tokens: int = 100):
# Split on double newlines (paragraph boundaries)
paragraphs = re.split(r'\n\s*\n', text)
chunks = []
current_chunk = []
current_size = 0
for para in paragraphs:
para_tokens = len(para.split()) * 1.3 # rough token estimate
if current_size + para_tokens > max_tokens and current_size >= min_tokens:
chunks.append('\n\n'.join(current_chunk))
current_chunk = [para]
current_size = para_tokens
else:
current_chunk.append(para)
current_size += para_tokens
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks# Strategy 3: Recursive chunking
# Best for: mixed content with varying section sizes
def recursive_chunks(text: str, max_tokens: int = 500,
separators=None):
if separators is None:
separators = ['\n## ', '\n### ', '\n\n', '\n', '. ', ' ']
token_count = len(text.split()) * 1.3
if token_count <= max_tokens:
return [text]
for sep in separators:
parts = text.split(sep)
if len(parts) > 1:
chunks = []
current = parts[0]
for part in parts[1:]:
candidate = current + sep + part
if len(candidate.split()) * 1.3 > max_tokens:
if current.strip():
chunks.extend(recursive_chunks(current, max_tokens, separators))
current = part
else:
current = candidate
if current.strip():
chunks.extend(recursive_chunks(current, max_tokens, separators))
return chunks
# Last resort: hard split
words = text.split()
mid = len(words) // 2
return (recursive_chunks(' '.join(words[:mid]), max_tokens, separators) +
recursive_chunks(' '.join(words[mid:]), max_tokens, separators))# Strategy 4: Parent-child chunking
# Best for: when you need precise matching with rich context
def parent_child_chunks(text: str, parent_size: int = 1000,
child_size: int = 200, overlap: int = 30):
parents = fixed_size_chunks(text, chunk_size=parent_size, overlap=0)
all_children = []
for parent_idx, parent in enumerate(parents):
children = fixed_size_chunks(parent, chunk_size=child_size,
overlap=overlap)
for child in children:
all_children.append({
"text": child,
"parent_idx": parent_idx,
"parent_text": parent
})
return all_children
# At query time: search against child embeddings,
# but return parent_text as context to the LLMThe right chunk size depends on your query patterns. For specific, targeted queries ("what is the connection pool timeout"), smaller chunks (200 to 400 tokens) produce more precise matches. For broad, explanatory queries ("explain the authentication architecture"), larger chunks (600 to 1,000 tokens) provide more complete answers. Start with 400 tokens as a default and adjust based on retrieval evaluation. Most production systems settle between 300 and 600 tokens after tuning.
Overlap ensures that information near chunk boundaries is captured in at least one chunk. Without overlap, a sentence spanning two chunks may not fully appear in either one, and neither chunk's embedding captures its meaning. An overlap of 10 to 15% of the chunk size (40 to 60 tokens for a 400-token chunk) handles most boundary cases without significantly increasing storage. Larger overlaps waste storage without improving retrieval.
Attach metadata to each chunk that enables filtering and provides context when the chunk is retrieved. At minimum, include the source document identifier, the section heading (if available), and the chunk position within the document. This metadata supports pre-filtering (search only within a specific document or section) and provides context to the LLM (the chunk came from section "Database Configuration" of the infrastructure guide).
def enrich_chunk(chunk_text: str, source_doc: str,
section: str, position: int) -> dict:
return {
"text": chunk_text,
"metadata": {
"source": source_doc,
"section": section,
"position": position,
"token_count": len(chunk_text.split()) * 1.3
}
}Test different chunk sizes on a set of queries with known relevant answers. Measure recall@k: what fraction of known relevant chunks appear in the top k results. If recall is low on specific queries, chunks may be too large (topic dilution). If recall is low on broad queries, chunks may be too small (insufficient context). Iterate until recall stabilizes, then lock in the chunk size for your production pipeline.
Common Mistakes
Embedding entire documents without chunking is the most common mistake. Even with large embedding models that accept long inputs, the resulting vector is a semantic average that matches no specific query well. Always chunk, even if your documents are relatively short (under 1,000 tokens).
Splitting mid-sentence creates chunks where the beginning or end is semantically incomplete. Always split at sentence boundaries at minimum, and prefer paragraph or section boundaries when available.
Ignoring chunk overlap at boundaries causes information loss. Sentences and concepts that span chunk boundaries are partially captured in each chunk, reducing the embedding quality of both. Even 30 to 50 tokens of overlap significantly reduces this problem.
Adaptive Recall handles chunking, embedding, and retrieval as a managed pipeline. Store memories in natural language and the system handles the rest.
Try It Free