Does Prompt Caching Actually Save Money
The Math Behind the Savings
Prompt caching works by storing the processed attention matrices for a repeated prompt prefix. On subsequent calls with the same prefix, the cached computation is reused instead of reprocessed. Anthropic prices cached tokens at 10% of the normal input rate, while cache creation incurs a 25% premium on the first write.
Here is the actual cost calculation for a realistic application:
| Parameter | Value |
|---|---|
| Cacheable prefix (system prompt + tools) | 5,000 tokens |
| API calls per day | 10,000 |
| Cache TTL | 5 minutes (refreshed on hit) |
| Cache creations per day | ~288 (one per 5-minute window) |
| Cache hits per day | ~9,712 |
Without caching: 10,000 calls x 5,000 tokens x $3.00/M = $150/day = $4,500/month
With caching:
- Cache writes: 288 calls x 5,000 tokens x $3.75/M = $5.40/day
- Cache reads: 9,712 calls x 5,000 tokens x $0.30/M = $14.57/day
- Total: $19.97/day = $599/month
Monthly savings: $3,901 (87% reduction on cached tokens)
When Caching Saves the Most
Caching is most effective when your cacheable prefix is large and your call volume is high. The larger the prefix, the more tokens benefit from the 90% discount. The higher the volume, the more cache hits you get per cache creation, amortizing the 25% write premium over more reads.
Applications that benefit most include high-volume chatbots with a detailed system prompt, RAG applications with static reference documentation in the prompt, customer support systems with extensive tool definitions, and any application where the same instructions appear in thousands of daily calls.
When Caching Helps Less
Caching provides minimal savings in three scenarios. First, when the cacheable prefix is short (under 1,024 tokens for Sonnet/Opus, under 2,048 for Haiku), the cache minimum is not met and no caching occurs. Second, when the prefix changes frequently (every few calls), most calls pay the cache creation premium rather than the cache read discount. Third, when call volume is very low (under 100 per day), there are not enough cache hits per TTL window to amortize the creation cost.
Even in the worst case, caching never costs more than normal processing. The cache creation premium is 25% above normal input cost, but this only applies to the first call in each TTL window. If even a single additional call hits the cache within the 5-minute window, the savings from that hit more than offset the creation premium.
How to Measure Your Actual Savings
After enabling prompt caching, monitor the usage fields in API responses. The cache_read_input_tokens field shows how many tokens were served from cache. The cache_creation_input_tokens field shows how many tokens were processed fresh and cached. Calculate your cache hit rate (reads divided by total cached tokens) and compare your actual costs against your pre-caching baseline.
A healthy cache hit rate is above 95% for steady-traffic applications. If your hit rate is below 90%, check for dynamic content in your cached prefix (timestamps, session IDs, variable instructions) that is causing cache misses. Moving any dynamic content after the cache breakpoint immediately improves the hit rate.
Combining Caching with External Memory
Prompt caching reduces the cost of static content in the context window. External memory reduces the amount of dynamic content needed. Together, they minimize total token cost: the system prompt is cached at 90% discount, and knowledge is stored externally so it does not consume context tokens at all. Only the specific memories retrieved for each query count as uncached input tokens, and these are typically a small fraction of the total context.
Minimize token costs with prompt caching plus external memory. Adaptive Recall keeps your context lean and your API bill low.
Try It Free