Home » Context Window Management » Choose Window Size

How to Choose the Right Context Window Size

Choosing the right context window size is a tradeoff between capability, cost, and latency. A window that is too small forces aggressive compression that loses information. A window that is too large wastes money on tokens the model cannot effectively attend to. The optimal size depends on your specific workload, and finding it requires measuring your actual context needs against cost and latency constraints.

The Decision Framework

Context window selection is not just a model choice. It is an architectural decision that affects cost structure, response latency, conversation management strategy, and the complexity of your retrieval pipeline. Choosing a 128k-token model because "bigger is better" without understanding your actual context usage is like renting a warehouse to store a bookshelf. The capacity is there, but you are paying for space you do not use, and the extra space does not make the bookshelf easier to find.

Step-by-Step Selection Process

Step 1: Profile your context requirements.
Measure the token usage of every prompt component across a representative sample of requests. Calculate the average, 90th percentile, and maximum for each: system prompt, tool definitions, conversation history at various conversation lengths, retrieved context, and response length. Your minimum context requirement is the sum of the 90th-percentile values because you want the window to handle most requests without compression, not just the average.
def profile_context(sample_requests): profiles = { "system": [], "tools": [], "history": [], "retrieval": [], "response": [] } for req in sample_requests: profiles["system"].append(count_tokens(req["system"])) profiles["tools"].append(count_tokens(str(req.get("tools", "")))) profiles["history"].append( sum(count_tokens(m["content"]) for m in req["history"]) ) profiles["retrieval"].append(count_tokens(req.get("retrieval", ""))) profiles["response"].append(count_tokens(req["response"])) for key, values in profiles.items(): p50 = sorted(values)[len(values)//2] p90 = sorted(values)[int(len(values)*0.9)] p_max = max(values) print(f"{key}: p50={p50}, p90={p90}, max={p_max}") total_p90 = sum( sorted(v)[int(len(v)*0.9)] for v in profiles.values() ) print(f"\nMinimum recommended context: {total_p90} tokens")
Step 2: Calculate your cost ceiling.
Set a per-call budget and a monthly budget based on your business model. A consumer chatbot charging $20 per month cannot afford $0.50 per API call. A B2B product charging $500 per seat per month can afford more. Work backward from your budget to find the maximum tokens per call you can sustain.
ModelInput $/M tokensOutput $/M tokensCost per 10k input + 1k output
GPT-4o$2.50$10.00$0.035
Claude Sonnet 4.6$3.00$15.00$0.045
Claude Haiku 4.5$0.80$4.00$0.012
Gemini 1.5 Pro$1.25$5.00$0.018

If your budget allows $0.05 per call and you need 1,000 output tokens, you have roughly 15,000 to 20,000 input tokens to work with depending on the model. This constraint is often tighter than the model's context window limit.

Step 3: Test quality at different context sizes.
Run your evaluation benchmark with artificially limited context sizes. Start with the minimum viable context (system prompt plus current query only), then progressively add conversation history and retrieved context. Measure quality at each level to find the point of diminishing returns. Most applications see 80% of their quality improvement from the first 4,000 to 8,000 tokens of context. The remaining 20% comes from adding more, at linearly increasing cost.
Step 4: Factor in latency requirements.
Time to first token increases roughly linearly with input token count. A 100k-token prompt takes about 3 to 5 times longer to start generating than a 20k-token prompt. For interactive applications where users expect near-instant responses, this latency matters. Measure your model's time-to-first-token at your expected input sizes and verify it meets your responsiveness target.
Step 5: Plan for growth.
Context requirements grow with your application. More users mean longer conversations. More features mean more tool definitions. More content means more retrieval results. Estimate how each component will grow over the next 6 to 12 months and verify that your chosen context size can accommodate that growth, or that your management strategy (compression, summarization, external memory) can scale to handle it.
Step 6: Consider hybrid architectures.
Often the most cost-effective solution is not a single large model but a combination of approaches. Use a smaller, faster model for simple queries that need minimal context. Route complex queries to a larger model with more context. Store persistent knowledge in external memory so it does not consume context on every call. This hybrid approach can deliver the same quality as a single large model at 30 to 50% of the cost.

Sizing Guidelines by Use Case

Use CaseTypical Context NeedsRecommended Approach
Simple Q&A chatbot4k to 8k tokens16k model, basic sliding window
Customer support with history8k to 16k tokens32k model, external memory for customer data
RAG over documents10k to 30k tokens128k model, chunk retrieval, top-5 selection
Coding assistant20k to 60k tokens128k+ model, external memory for codebase knowledge
Multi-document analysis50k to 200k tokens200k model or map-reduce pattern

These are starting points. Your actual needs depend on your specific content, conversation patterns, and quality requirements. Profile your usage (step 1) to get precise numbers.

Adaptive Recall lets you use smaller context windows without losing knowledge. Store persistent context externally and retrieve only what each query needs.

Get Started Free