Home » Context Window Management » Prompt Caching Savings

Does Prompt Caching Actually Save Money

Yes. Anthropic's prompt caching charges 90% less for cached input tokens, reducing the cost from $3.00 to $0.30 per million tokens for Claude Sonnet 4.6. For an application with a 5,000-token system prompt making 10,000 calls per day, caching saves approximately $4,000 per month. The savings scale with the size of the cacheable prefix and the number of API calls.

The Math Behind the Savings

Prompt caching works by storing the processed attention matrices for a repeated prompt prefix. On subsequent calls with the same prefix, the cached computation is reused instead of reprocessed. Anthropic prices cached tokens at 10% of the normal input rate, while cache creation incurs a 25% premium on the first write.

Here is the actual cost calculation for a realistic application:

Parameter	Value
Cacheable prefix (system prompt + tools)	5,000 tokens
API calls per day	10,000
Cache TTL	5 minutes (refreshed on hit)
Cache creations per day	~288 (one per 5-minute window)
Cache hits per day	~9,712

Without caching: 10,000 calls x 5,000 tokens x $3.00/M = $150/day = $4,500/month

With caching:

Cache writes: 288 calls x 5,000 tokens x $3.75/M = $5.40/day
Cache reads: 9,712 calls x 5,000 tokens x $0.30/M = $14.57/day
Total: $19.97/day = $599/month

Monthly savings: $3,901 (87% reduction on cached tokens)

When Caching Saves the Most

Caching is most effective when your cacheable prefix is large and your call volume is high. The larger the prefix, the more tokens benefit from the 90% discount. The higher the volume, the more cache hits you get per cache creation, amortizing the 25% write premium over more reads.

Applications that benefit most include high-volume chatbots with a detailed system prompt, RAG applications with static reference documentation in the prompt, customer support systems with extensive tool definitions, and any application where the same instructions appear in thousands of daily calls.

When Caching Helps Less

Caching provides minimal savings in three scenarios. First, when the cacheable prefix is short (under 1,024 tokens for Sonnet/Opus, under 2,048 for Haiku), the cache minimum is not met and no caching occurs. Second, when the prefix changes frequently (every few calls), most calls pay the cache creation premium rather than the cache read discount. Third, when call volume is very low (under 100 per day), there are not enough cache hits per TTL window to amortize the creation cost.

Even in the worst case, caching never costs more than normal processing. The cache creation premium is 25% above normal input cost, but this only applies to the first call in each TTL window. If even a single additional call hits the cache within the 5-minute window, the savings from that hit more than offset the creation premium.

How to Measure Your Actual Savings

After enabling prompt caching, monitor the usage fields in API responses. The cache_read_input_tokens field shows how many tokens were served from cache. The cache_creation_input_tokens field shows how many tokens were processed fresh and cached. Calculate your cache hit rate (reads divided by total cached tokens) and compare your actual costs against your pre-caching baseline.

A healthy cache hit rate is above 95% for steady-traffic applications. If your hit rate is below 90%, check for dynamic content in your cached prefix (timestamps, session IDs, variable instructions) that is causing cache misses. Moving any dynamic content after the cache breakpoint immediately improves the hit rate.

Combining Caching with External Memory

Prompt caching reduces the cost of static content in the context window. External memory reduces the amount of dynamic content needed. Together, they minimize total token cost: the system prompt is cached at 90% discount, and knowledge is stored externally so it does not consume context tokens at all. Only the specific memories retrieved for each query count as uncached input tokens, and these are typically a small fraction of the total context.

Minimize token costs with prompt caching plus external memory. Adaptive Recall keeps your context lean and your API bill low.

Try It Free