How to Handle Context Window Overflow
Why Overflow Handling Matters
Without explicit overflow handling, different LLM providers handle the problem differently, and none of them handle it well. Some providers silently truncate the oldest messages, which can remove your system prompt and cause the model to ignore critical instructions. Others return a 400 error, which crashes your application. A few attempt to truncate from the middle, which corrupts multi-turn conversation flow. In every case, the user experience degrades in ways that are difficult to diagnose because there is no obvious error message, just an AI that suddenly seems confused or forgetful.
Proactive overflow handling means your application never sends a request that exceeds the context window. Instead, it monitors token usage continuously and applies reduction strategies before the limit is reached. This approach turns overflow from a runtime error into a design constraint that is managed by architecture rather than by luck.
Step-by-Step Implementation
Install the tokenizer that matches your model. OpenAI models use tiktoken, Anthropic models use their own tokenizer, and open-source models typically use the Hugging Face tokenizers library. Count tokens for every component of your prompt separately: system prompt, tool definitions, conversation history, retrieved context, and any static instructions. Counting each section independently lets you know exactly where your tokens are being spent.
import tiktoken
def count_tokens(text, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def count_message_tokens(messages, model="gpt-4o"):
total = 0
for msg in messages:
total += count_tokens(msg["content"], model)
total += 4 # overhead per message (role, formatting)
total += 2 # priming tokens
return totalReserve 80% of the model's context window for input, leaving 20% for the response and a safety margin. For a 128k-token model, that means your input budget is roughly 102,000 tokens. For a 16k-token model, your budget is about 12,800 tokens. The safety margin accounts for tokenization differences between your local counter and the provider's counter, which can vary by 1 to 3% depending on the content.
MODEL_LIMITS = {
"gpt-4o": 128000,
"claude-sonnet-4-6": 200000,
"claude-haiku-4-5": 200000,
}
def get_input_budget(model, response_reserve=0.20):
limit = MODEL_LIMITS.get(model, 16000)
return int(limit * (1 - response_reserve))Rank every section of your prompt by importance. The system prompt is highest priority because losing it changes model behavior entirely. Tool definitions are next because they enable functionality. The most recent user message is high priority because the model needs it to generate a relevant response. Retrieved context and older conversation history are lower priority because they add value but the model can function without them.
A typical priority ordering:
| Priority | Section | Reduction Strategy |
|---|---|---|
| 1 (highest) | System prompt | Never reduce |
| 2 | Tool definitions | Remove rarely used tools |
| 3 | Current user message | Never reduce |
| 4 | Recent history (last 3 turns) | Summarize if needed |
| 5 | Retrieved context | Reduce to top-k results |
| 6 | Older history | Summarize aggressively |
| 7 (lowest) | Examples and few-shot prompts | Remove entirely |
When the total token count exceeds your budget, apply reduction strategies starting from the lowest priority section. Remove few-shot examples first. Then summarize older conversation history. Then reduce retrieved context from 10 results to 5. Then summarize recent history. Continue until the total fits within the budget. If you reach the system prompt and the total still exceeds the budget, your system prompt is too large and needs to be shortened at design time rather than at runtime.
def fit_to_budget(sections, budget, model):
total = sum(count_tokens(s["content"], model) for s in sections)
if total <= budget:
return sections
# reduce lowest priority first
for section in sorted(sections, key=lambda s: -s["priority"]):
if total <= budget:
break
if section["strategy"] == "remove":
total -= count_tokens(section["content"], model)
section["content"] = ""
elif section["strategy"] == "summarize":
original_tokens = count_tokens(section["content"], model)
section["content"] = summarize(section["content"])
new_tokens = count_tokens(section["content"], model)
total -= (original_tokens - new_tokens)
elif section["strategy"] == "top_k":
original_tokens = count_tokens(section["content"], model)
section["content"] = keep_top_k(section["content"], k=3)
new_tokens = count_tokens(section["content"], model)
total -= (original_tokens - new_tokens)
return [s for s in sections if s["content"]]Log every instance where reduction was applied, including which sections were affected and how many tokens were removed. This data tells you how often your application is under context pressure, which sections consume the most space, and whether your token budget allocations need adjustment. If overflow reduction triggers on more than 10% of calls, you either need a larger model, a smaller system prompt, or an external memory system to offload persistent context.
Simulate conversations with 50, 100, and 200 turns to verify that your overflow handling degrades gracefully at every length. Check that the system prompt survives at maximum conversation length. Verify that recent context is still available after aggressive summarization. Confirm that the model's response quality remains acceptable when operating under maximum context pressure.
The External Memory Alternative
Overflow handling manages a scarce resource. External memory eliminates the scarcity. When persistent knowledge lives in a memory system outside the context window, the only things in the context are the system prompt, the current query, and the few specific memories relevant to that query. Conversations can run indefinitely without overflow because old messages are stored and retrieved on demand rather than accumulated in the prompt.
Adaptive Recall's MCP integration makes this transparent to the LLM. The model calls the recall tool to retrieve relevant context for each query, uses it to generate a response, and the context leaves the window after the response. The system prompt stays constant, the conversation history stays short, and the model always operates well within its context budget.
Stop managing overflow. Move persistent knowledge to external memory where conversations never hit the limit.
Try It Free