Home » Context Window Management » Cost Per Token

How Much Does Each Extra Token Cost

Token costs vary by provider and model. For Claude Sonnet 4.6, each input token costs $0.000003 ($3.00 per million) and each output token costs $0.000015 ($15.00 per million). For GPT-4o, input is $0.0000025 and output is $0.000010. Cached input tokens on Anthropic cost 90% less at $0.0000003 each. Output tokens are typically 3 to 5 times more expensive than input tokens, which makes controlling response length as important as controlling context size.

Current Token Pricing

Model	Input ($/M tokens)	Output ($/M tokens)	Cached Input ($/M)
Claude Opus 4	$15.00	$75.00	$1.50
Claude Sonnet 4.6	$3.00	$15.00	$0.30
Claude Haiku 4.5	$0.80	$4.00	$0.08
GPT-4o	$2.50	$10.00	$1.25
GPT-4o mini	$0.15	$0.60	$0.075
Gemini 1.5 Pro	$1.25	$5.00	N/A
Gemini 1.5 Flash	$0.075	$0.30	N/A

Note: Prices as of early 2026. Check each provider's pricing page for current rates. Batch API pricing and volume discounts may reduce these costs further.

What Tokens Cost in Context

Individual token costs are tiny, but they compound across calls. Here is what adding 10,000 extra tokens to your context costs at different call volumes:

Model	Cost per 10k extra tokens	1,000 calls/day	10,000 calls/day	Monthly (10k calls/day)
Claude Sonnet 4.6	$0.030	$30/day	$300/day	$9,000
GPT-4o	$0.025	$25/day	$250/day	$7,500
Claude Haiku 4.5	$0.008	$8/day	$80/day	$2,400
GPT-4o mini	$0.0015	$1.50/day	$15/day	$450

Adding 10,000 unnecessary tokens to every API call on Claude Sonnet at 10,000 calls per day costs $9,000 per month. That is the cost of a verbose system prompt that could be compressed, retrieval results that are not filtered, or conversation history that is not managed. Every token you eliminate from the context directly reduces this cost.

Input vs Output: The Hidden Asymmetry

Output tokens cost 3 to 5 times more than input tokens on every major provider. A 2,000-token response on Claude Sonnet costs $0.030, which is the same as 10,000 input tokens. This asymmetry means that controlling response length is as cost-effective as reducing context size, but is often overlooked.

Practical techniques for controlling output cost include setting max_tokens to a reasonable limit for your use case, instructing the model to be concise for simple queries, and using structured output formats (JSON, bullet points) that are naturally shorter than prose. A query that can be answered in 200 tokens should not generate a 2,000-token response just because the model defaulted to verbose mode.

The Compounding Effect

Token costs compound across multiple factors. A chatbot that starts with 5,000 tokens of context and grows to 50,000 tokens over a long conversation has a 10x cost increase per call by the end. If the response also grows longer as the conversation gets more complex (from 500 to 2,000 tokens), the total cost per call has increased 12x from the first message to the last.

This compounding is why context management is primarily a cost engineering discipline. The techniques described in other articles in this pillar (sliding windows, compression, prompt caching, external memory) are not just about fitting within the context limit. They are about keeping per-call costs constant regardless of conversation length, knowledge base size, or query complexity.

When External Memory Is Cheaper Than Tokens

External memory systems have their own costs: storage for memories, computation for retrieval, and API costs for the memory service itself. The question is whether these costs are less than the token costs they replace.

For an application that includes 20,000 tokens of persistent knowledge in every prompt, switching to external memory that retrieves 2,000 relevant tokens per query saves 18,000 tokens per call. On Claude Sonnet at 10,000 calls per day, that is $0.054 saved per call, or $16,200 per month. If the memory system costs $500 per month to operate, the net savings are $15,700 per month.

The break-even point depends on your token volume and the cost of the memory system, but for most applications making more than 1,000 calls per day with persistent knowledge in the context, external memory pays for itself many times over.

Every token you remove from context saves money. Adaptive Recall stores knowledge externally so you pay only for the tokens each query actually needs.

Get Started Free