The Optimal Chunk Size for RAG: Benchmarks
What the Benchmarks Show
Multiple studies have evaluated chunk size impact on RAG retrieval quality. The consistent finding across different embedding models, vector databases, and content types is that there is a sweet spot between 400 and 600 tokens where retrieval recall and answer quality are both strong. Below this range, precision improves but context degrades. Above this range, context improves but precision degrades.
LlamaIndex's evaluation on a technical documentation corpus showed that 256-token chunks achieved the highest recall@5 (the retrieved chunks were more likely to contain the exact answer) but the lowest answer quality (the LLM struggled to generate complete answers from such short context). 512-token chunks achieved slightly lower recall@5 but substantially better answer quality because each chunk contained enough context for the LLM to synthesize a useful response. 1,024-token chunks had the best answer quality on broad questions but the worst recall on specific questions because the large chunks diluted the embedding signal.
# Approximate benchmark results from published evaluations
# Corpus: technical documentation, 10K documents
# Embedding: text-embedding-3-small
# Retrieval: top-5, cosine similarity
Chunk Size | Recall@5 | Precision@5 | Answer Quality
-----------|----------|-------------|---------------
128 tokens | 0.72 | 0.35 | 4.2/10
256 tokens | 0.81 | 0.42 | 6.1/10
400 tokens | 0.79 | 0.44 | 7.4/10
512 tokens | 0.77 | 0.43 | 7.8/10
768 tokens | 0.73 | 0.38 | 7.5/10
1024 tokens| 0.68 | 0.32 | 7.2/10
# Answer quality judged by LLM-as-judge on completeness and accuracyWhy Smaller Is Not Always Better
Small chunks produce focused embeddings that match specific queries precisely. A 200-token chunk about database connection pooling produces an embedding that strongly matches any query about connection pooling. But when this chunk is included in an LLM's context window, it often lacks the surrounding information needed to generate a useful answer. The LLM may know that connection pooling is configured in a specific file but not how to configure it, because the configuration details were in an adjacent chunk.
The "small chunks, high recall" phenomenon is misleading because recall measures whether the right chunk was retrieved, not whether the LLM received enough information to answer the question. A system can have perfect recall (the chunk containing the answer is always in the top results) and still produce poor answers because the chunk is too short to be useful on its own.
Why Larger Is Not Always Better
Large chunks (800+ tokens) contain more context, which helps the LLM generate comprehensive answers. But the embedding for a large chunk is an average across everything in the chunk. If a 1,000-token chunk covers both database configuration and monitoring setup, the embedding sits somewhere between both topics. A query about database configuration matches this chunk less strongly than a 400-token chunk focused entirely on database configuration.
This topic dilution effect worsens as chunk size increases. At 2,000 tokens, most chunks cover multiple distinct topics, and the embedding is too general to strongly match any specific query. The result is that more relevant chunks appear lower in the ranking (pushed down by other chunks that happen to cover the query topic more centrally), reducing recall.
Content Type Affects Optimal Size
Technical documentation benefits from 400 to 600 token chunks. Documentation sections tend to cover one topic per section, and this chunk size aligns well with section lengths. Splitting at section boundaries (semantic chunking) often outperforms fixed-size chunking for documentation.
Support tickets and conversations work best with 200 to 400 token chunks. Individual messages and ticket updates are naturally short, and the relevant information is usually concentrated in a single message rather than spread across a long thread. Chunking at message boundaries is often better than fixed-size chunking.
Long-form content (reports, research papers, articles) benefits from 600 to 1,000 token chunks. The topics in long-form content develop gradually with paragraphs building on each other, so larger chunks preserve the narrative flow that the LLM needs to generate accurate summaries and explanations.
Code should be chunked at function or class boundaries rather than token counts. A 50-line function is a complete semantic unit regardless of token count, and splitting it mid-function destroys the embedding quality. If functions are very long, chunk at logical blocks (setup, processing, cleanup) within the function.
Query Pattern Affects Optimal Size
Specific factual queries ("what port does Redis use") are best served by small chunks (200 to 400 tokens) because the answer is a specific fact that is concentrated in a small region of text. The embedding of a small chunk focused on Redis configuration strongly matches this query.
Explanatory queries ("how does the authentication system work") are best served by larger chunks (600 to 1,000 tokens) because the answer requires multiple paragraphs of context. Returning five 200-token chunks about authentication gives the LLM five disconnected fragments instead of a coherent explanation.
Mixed query patterns (most production applications) benefit from parent-child chunking: index small chunks (200 to 300 tokens) for precise embedding matching, but return the parent chunk (800 to 1,000 tokens) as context to the LLM. This gives you the retrieval precision of small chunks and the answer quality of large chunks.
Practical Recommendation
Start with 512 tokens, semantic boundaries preferred (paragraphs, sections) with fixed-size fallback, and 50-token overlap. Measure recall@10 and answer quality on 50+ queries. If recall is low on specific queries, try smaller chunks. If answer quality is low on broad queries, try larger chunks or parent-child chunking. Most teams arrive at 400 to 600 tokens after one or two rounds of evaluation.
Adaptive Recall manages memory storage and retrieval without requiring you to design chunking pipelines. Store memories as natural language, and the system handles segmentation and retrieval.
Try It Free