Home » Context Window Management » GPT-4 Token Capacity

How Many Tokens Can GPT-4 Handle at Once

GPT-4o supports 128,000 tokens per request, which is approximately 96,000 words or 200 pages of text. This includes both input and output tokens. The original GPT-4 launched with 8,192 tokens, GPT-4 Turbo expanded to 128,000, and GPT-4o maintains that limit. The maximum output is capped at 16,384 tokens per response, so the practical input capacity is about 112,000 tokens.

Token Limits Across GPT-4 Variants

Model	Context Window	Max Output	Approximate Pages
GPT-4 (original)	8,192	4,096	~12 pages
GPT-4-32k	32,768	4,096	~50 pages
GPT-4 Turbo	128,000	4,096	~200 pages
GPT-4o	128,000	16,384	~200 pages
GPT-4o mini	128,000	16,384	~200 pages

What 128,000 Tokens Actually Fits

Tokens are not the same as words. In English, one token averages about 0.75 words. But the ratio varies by content type. Plain English prose tokenizes efficiently at about 1.3 tokens per word. Code is less efficient, averaging 1.5 to 2.0 tokens per word because variable names, syntax characters, and whitespace each consume tokens. JSON and XML are the least efficient, sometimes requiring 3 or more tokens per word of actual content because of structural characters.

In practical terms, 128,000 tokens holds approximately:

200 pages of English prose (a short novel)
120 pages of source code (several thousand lines)
80 pages of JSON/XML data
A combination of system prompt (2k), conversation history (10k), retrieved documents (20k), and response (4k), repeated over many calls

How GPT-4 Compares

GPT-4o's 128k window is the industry standard for mid-tier models. Claude's models offer 200,000 tokens, a 56% advantage that matters for long-document tasks. Gemini 1.5 Pro offers 1,000,000 tokens, the largest commercially available window, though attention quality at that scale is still being studied. Open-source models like Llama 3.1 also support 128,000 tokens when properly hosted.

For most applications, the difference between 128k and 200k is not meaningful because best practices (curated context, sliding windows, external memory) keep actual token usage well below either limit. The difference matters only for tasks that genuinely require processing very long inputs in a single pass, like analyzing entire codebases or reviewing lengthy legal documents.

Practical Capacity vs Theoretical Maximum

The 128k maximum is shared between input and output. After reserving tokens for the response (up to 16,384), the practical input capacity is about 112,000 tokens. After accounting for the system prompt (typically 1,000 to 5,000 tokens), tool definitions (1,000 to 3,000 tokens), and a safety margin, the effective capacity for conversation history and retrieved context is roughly 100,000 tokens.

Even this effective capacity is misleading because model attention degrades on information in the middle of long contexts. Research suggests that GPT-4's effective attention covers about 40 to 50% of its context window reliably. For critical applications where accuracy matters, plan around 50,000 to 60,000 tokens of active, reliable context rather than 128,000.

Beyond the Token Limit

If your application needs to work with more information than 128,000 tokens, you have two options: use a model with a larger window (Gemini at 1M, Claude at 200k) or use external memory to store knowledge outside the context window. External memory is almost always the better choice because it scales independently of the model's window, keeps cost proportional to query complexity rather than knowledge base size, and avoids the attention quality issues of very long contexts.

Give GPT-4 or any model access to unlimited knowledge. Adaptive Recall stores memories externally and retrieves exactly what each query needs.

Try It Free