How Many Tokens Can GPT-4 Handle at Once
Token Limits Across GPT-4 Variants
| Model | Context Window | Max Output | Approximate Pages |
|---|---|---|---|
| GPT-4 (original) | 8,192 | 4,096 | ~12 pages |
| GPT-4-32k | 32,768 | 4,096 | ~50 pages |
| GPT-4 Turbo | 128,000 | 4,096 | ~200 pages |
| GPT-4o | 128,000 | 16,384 | ~200 pages |
| GPT-4o mini | 128,000 | 16,384 | ~200 pages |
What 128,000 Tokens Actually Fits
Tokens are not the same as words. In English, one token averages about 0.75 words. But the ratio varies by content type. Plain English prose tokenizes efficiently at about 1.3 tokens per word. Code is less efficient, averaging 1.5 to 2.0 tokens per word because variable names, syntax characters, and whitespace each consume tokens. JSON and XML are the least efficient, sometimes requiring 3 or more tokens per word of actual content because of structural characters.
In practical terms, 128,000 tokens holds approximately:
- 200 pages of English prose (a short novel)
- 120 pages of source code (several thousand lines)
- 80 pages of JSON/XML data
- A combination of system prompt (2k), conversation history (10k), retrieved documents (20k), and response (4k), repeated over many calls
How GPT-4 Compares
GPT-4o's 128k window is the industry standard for mid-tier models. Claude's models offer 200,000 tokens, a 56% advantage that matters for long-document tasks. Gemini 1.5 Pro offers 1,000,000 tokens, the largest commercially available window, though attention quality at that scale is still being studied. Open-source models like Llama 3.1 also support 128,000 tokens when properly hosted.
For most applications, the difference between 128k and 200k is not meaningful because best practices (curated context, sliding windows, external memory) keep actual token usage well below either limit. The difference matters only for tasks that genuinely require processing very long inputs in a single pass, like analyzing entire codebases or reviewing lengthy legal documents.
Practical Capacity vs Theoretical Maximum
The 128k maximum is shared between input and output. After reserving tokens for the response (up to 16,384), the practical input capacity is about 112,000 tokens. After accounting for the system prompt (typically 1,000 to 5,000 tokens), tool definitions (1,000 to 3,000 tokens), and a safety margin, the effective capacity for conversation history and retrieved context is roughly 100,000 tokens.
Even this effective capacity is misleading because model attention degrades on information in the middle of long contexts. Research suggests that GPT-4's effective attention covers about 40 to 50% of its context window reliably. For critical applications where accuracy matters, plan around 50,000 to 60,000 tokens of active, reliable context rather than 128,000.
Beyond the Token Limit
If your application needs to work with more information than 128,000 tokens, you have two options: use a model with a larger window (Gemini at 1M, Claude at 200k) or use external memory to store knowledge outside the context window. External memory is almost always the better choice because it scales independently of the model's window, keeps cost proportional to query complexity rather than knowledge base size, and avoids the attention quality issues of very long contexts.
Give GPT-4 or any model access to unlimited knowledge. Adaptive Recall stores memories externally and retrieves exactly what each query needs.
Try It Free