Context Window Sizes: Every Major Model Compared
2026 Context Window Comparison
| Model | Context Window | Approx. Pages of Text | Input $/M tokens | Output $/M tokens |
|---|---|---|---|---|
| GPT-4o | 128,000 | ~200 pages | $2.50 | $10.00 |
| GPT-4o mini | 128,000 | ~200 pages | $0.15 | $0.60 |
| Claude Opus 4 | 200,000 | ~300 pages | $15.00 | $75.00 |
| Claude Sonnet 4.6 | 200,000 | ~300 pages | $3.00 | $15.00 |
| Claude Haiku 4.5 | 200,000 | ~300 pages | $0.80 | $4.00 |
| Gemini 1.5 Pro | 1,000,000 | ~1,500 pages | $1.25 | $5.00 |
| Gemini 1.5 Flash | 1,000,000 | ~1,500 pages | $0.075 | $0.30 |
| Llama 3.1 (405B) | 128,000 | ~200 pages | Varies by host | Varies by host |
| Mistral Large | 128,000 | ~200 pages | $2.00 | $6.00 |
What the Numbers Actually Mean
Token Counts in Context
Raw token limits are abstract. Here is what various token counts translate to in practical terms:
- 4,000 tokens: About 6 pages of text or a short article. Enough for a system prompt, a brief conversation, and a response. This was GPT-3.5's original limit.
- 16,000 tokens: About 24 pages. Fits a system prompt, a moderate conversation (10 to 15 turns), and some retrieved context. Adequate for most chatbot applications.
- 128,000 tokens: About 200 pages, roughly the length of a short novel. Can hold large documents, long conversations, and substantial retrieved context. The current standard for most production applications.
- 200,000 tokens: About 300 pages. Claude's window fits a technical book or several large documents simultaneously.
- 1,000,000 tokens: About 1,500 pages. Gemini's window can theoretically hold multiple books. In practice, attention quality at this scale is still an active area of research.
The Cost of Context
Context window size and cost are directly linked because processing more tokens requires more computation. The cost of filling a context window varies dramatically by model:
| Scenario | GPT-4o | Claude Sonnet 4.6 | Claude Haiku 4.5 | Gemini 1.5 Flash |
|---|---|---|---|---|
| 10k input + 1k output | $0.035 | $0.045 | $0.012 | $0.001 |
| 50k input + 2k output | $0.145 | $0.180 | $0.048 | $0.004 |
| 100k input + 4k output | $0.290 | $0.360 | $0.096 | $0.009 |
The difference between using 10k tokens and 100k tokens is a factor of 8 in cost. For an application making 10,000 calls per day, the difference between a 10k-token and a 100k-token average context is $2,550 per day for Claude Sonnet. This is why context management matters, every token you remove from the context directly reduces cost.
Effective vs Advertised Window Size
The advertised context window is the theoretical maximum. The effective window, where the model reliably uses the information you provide, is smaller. Several factors reduce the effective window:
Lost in the middle: Information at the beginning and end of the context receives more attention than information in the middle. For contexts over 20,000 tokens, this effect becomes significant. Placing critical information in the middle of a long context can reduce the probability of it being used by 20% or more compared to placing it at the beginning or end.
Attention dilution: As context length increases, the model's attention is spread across more tokens. Each individual token receives proportionally less attention. This means that a highly relevant 500-token document competes for attention with every other token in the context. In a 10k-token context, it gets roughly 5% of the attention budget. In a 100k-token context, it gets roughly 0.5%.
Instruction following degradation: Long contexts can reduce the model's adherence to system prompt instructions. The system prompt is at the beginning of the context, and as the context grows, the "distance" between the system prompt and the current query increases, weakening the system prompt's influence on the response.
A practical guideline is to plan around 40% of the advertised window as the effective size for retrieval and reasoning tasks. For a 128k model, plan around 50k usable tokens. For a 200k model, plan around 80k. This does not mean you cannot use more, but quality will be measurably better with curated context than with maximum context.
Model Selection by Use Case
Simple chatbots and Q&A: Claude Haiku 4.5 or GPT-4o mini offer the best price/performance ratio. Their 128k to 200k windows are more than adequate, and their per-token cost makes high-volume applications affordable. Use a 10k to 15k token working context with a sliding window for history.
RAG applications: Claude Sonnet 4.6 or GPT-4o provide the reasoning quality needed for complex retrieval tasks. Use 20k to 40k tokens of context with curated retrieval rather than filling the full window. Enable prompt caching for the system prompt to reduce costs.
Long document processing: Gemini 1.5 Pro's 1M-token window is unique for tasks that genuinely require processing very long documents in a single pass. The cost per token is competitive, but processing a full 1M-token context is expensive in absolute terms ($1.25 per call for input alone). Consider whether chunking and map-reduce can achieve the same result at lower cost.
Complex reasoning and analysis: Claude Opus 4 provides the highest reasoning quality. Its 200k window and strong attention across long contexts make it suitable for tasks where accuracy matters more than speed or cost. The high per-token cost means it should be reserved for high-value tasks rather than routine queries.
The External Memory Alternative
For applications where the knowledge base exceeds any context window, external memory is the architecture to adopt. Instead of choosing a larger model with a bigger window, store knowledge in a persistent memory system and retrieve the relevant subset for each query. This approach works with any model size, keeps costs proportional to query complexity rather than knowledge base size, and eliminates the attention quality issues that come with very long contexts.
Use the right model for your task, not the biggest one. Adaptive Recall provides the knowledge layer so your LLM only needs context for the current query.
Get Started Free