How to Manage Long Documents in LLM Apps
The Long Document Problem
A 50-page technical document contains roughly 25,000 words, which tokenizes to approximately 33,000 tokens. On a model with a 128k context window, the entire document fits with room for instructions and a response. But fitting and working well are different things. Research consistently shows that LLM accuracy on questions about long documents degrades when the relevant information is buried in the middle of the context. A 33,000-token document takes up significant attention budget, and irrelevant sections dilute the model's focus on the sections that actually answer the question.
Even worse, many real-world documents are much longer than 50 pages. Legal contracts run to 200 pages. Codebases span thousands of files. Research paper collections can be millions of tokens. For these, fitting the entire document in context is not an option even with the largest available models. You need a strategy for extracting and presenting the relevant portions.
Step-by-Step Implementation
Calculate the token count of your document and compare it to your available context budget (model limit minus system prompt, conversation history, and response reserve). If the document fits within 40% of the model's context window, you can include it directly with acceptable quality. If it exceeds 40%, use chunking and retrieval. If it exceeds the full context window, chunking is mandatory.
import tiktoken
def assess_document(doc_text, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
doc_tokens = len(enc.encode(doc_text))
model_limit = 128000
budget = int(model_limit * 0.4)
if doc_tokens <= budget:
return "inline", doc_tokens
elif doc_tokens <= model_limit:
return "chunk_recommended", doc_tokens
else:
return "chunk_required", doc_tokensThe goal of chunking is to split the document into pieces that are semantically coherent (each chunk covers one topic or section) and appropriately sized (large enough to contain useful context but small enough to leave room in the context window for multiple chunks). Three approaches work well depending on document structure.
Header-based chunking splits on document headers (H1, H2, H3) so each chunk corresponds to a named section. This preserves the document's logical structure and works well for structured documents like technical documentation, research papers, and legal contracts. Each chunk gets the section header as metadata for context.
Recursive character splitting splits by paragraph first, then by sentence if paragraphs exceed the chunk size limit. This works for unstructured text like emails, transcripts, and free-form writing. The recursive approach ensures chunks break at natural boundaries rather than mid-sentence.
Semantic chunking uses embedding similarity to detect topic shifts and splits at points where the content changes topic. This produces the most coherent chunks but requires more computation because every sentence needs to be embedded. Use this for documents where the structure is unclear or where header-based splitting would produce chunks that are too large or too small.
def chunk_by_headers(text, max_tokens=500):
import re
sections = re.split(r'\n(?=#{1,3}\s)', text)
chunks = []
for section in sections:
tokens = count_tokens(section)
if tokens <= max_tokens:
chunks.append(section)
else:
paragraphs = section.split('\n\n')
current = ""
for para in paragraphs:
if count_tokens(current + para) <= max_tokens:
current += para + "\n\n"
else:
if current:
chunks.append(current.strip())
current = para + "\n\n"
if current:
chunks.append(current.strip())
return chunksEmbed each chunk using your embedding model and store the embeddings alongside the chunk text and metadata (source document, section header, position in document). At query time, embed the query and retrieve the top 3 to 5 most similar chunks. Include the chunk metadata in the context so the model knows where the information came from.
Chunk-level retrieval works well for specific questions ("What is the cancellation policy?") but poorly for overview questions ("Summarize this document"). For overview queries, build a summary hierarchy: summarize each section, then summarize the summaries into a document-level overview. Store these summaries alongside the raw chunks and route overview queries to the summary level.
def build_summary_hierarchy(chunks, sections):
section_summaries = {}
for section_name, section_chunks in sections.items():
combined = "\n".join(section_chunks)
section_summaries[section_name] = summarize(combined, max_words=200)
doc_summary = summarize(
"\n".join(section_summaries.values()),
max_words=500
)
return {
"document": doc_summary,
"sections": section_summaries,
"chunks": chunks
}When a document references "the requirements discussed in Section 3" from Section 7, the chunk for Section 7 is incomplete without Section 3. Add overlapping context by including the last 2 to 3 sentences of the previous chunk at the beginning of each chunk. Also store cross-reference metadata so the retrieval system can automatically include referenced sections when they are mentioned.
Tasks like "extract all action items from this meeting transcript" or "identify all risk factors in this contract" require processing every part of the document, not just relevant chunks. Use a map-reduce pattern: process each chunk independently (the map step), then combine the results (the reduce step). Each chunk gets its own LLM call with a focused instruction, and the outputs are merged into a final result.
def map_reduce_extract(chunks, instruction, model):
# Map: process each chunk independently
chunk_results = []
for chunk in chunks:
result = llm_call(
system=f"Extract from this text section: {instruction}",
user=chunk,
model=model
)
chunk_results.append(result)
# Reduce: combine results
combined = "\n---\n".join(chunk_results)
final = llm_call(
system=f"Combine these extracted results. Remove duplicates. {instruction}",
user=combined,
model=model
)
return finalStoring Documents as Memories
For documents that will be queried repeatedly, storing them as structured memories in an external system is more efficient than re-chunking and re-embedding on every query. Adaptive Recall can store document chunks as individual memories with entity extraction, building a knowledge graph that connects concepts across chunks and across documents. Subsequent queries retrieve the relevant chunks through cognitive scoring, which considers not just similarity but also recency, access frequency, and entity connections.
Turn documents into queryable memory. Adaptive Recall chunks, embeds, and indexes documents so every query retrieves exactly what it needs.
Get Started Free