Home » Context Window Management » Prompt Caching

How to Implement Prompt Caching with Anthropic

Prompt caching reduces the cost of repeated prompt prefixes by up to 90% and reduces latency for cached tokens. When your system prompt, tool definitions, and static instructions are the same across multiple API calls, Anthropic caches the processed attention matrices so they do not need to be recomputed. This guide shows you how to structure your prompts for caching, add cache control markers, and measure the actual savings.

How Prompt Caching Works

When an LLM processes input tokens, it computes key-value (KV) attention matrices that represent how each token relates to every other token. This computation is the most expensive part of inference. Prompt caching stores these KV matrices for the prefix portion of a prompt so that subsequent requests with the same prefix skip the computation entirely.

Anthropic's implementation works at the content block level. You mark specific content blocks with a cache_control parameter, and the API caches everything up to and including that block. On subsequent requests, if the prompt prefix matches the cached version exactly (byte-for-byte), the cached KV matrices are reused. The cache has a 5-minute TTL that refreshes on each cache hit, so active applications maintain their cache indefinitely while idle applications let it expire naturally.

The economics are significant. Cached input tokens are priced at 10% of the normal input token rate. For Claude Sonnet 4.6, that means cached input costs $0.30 per million tokens instead of $3.00 per million. For an application that includes a 5,000-token system prompt in every call and makes 10,000 calls per day, prompt caching saves approximately $135 per day on the system prompt alone.

Step-by-Step Implementation

Step 1: Understand what is cacheable.
Any content block that is identical across multiple requests can be cached. This includes system prompts, tool definitions, static few-shot examples, reference documentation, and any other content that does not change between calls. Content that changes on every call, such as conversation history, the current user message, and dynamically retrieved context, cannot be cached.

The minimum cacheable prefix length is 1,024 tokens for Claude Sonnet and Opus, and 2,048 tokens for Claude Haiku. Shorter prefixes are not cached because the overhead of cache management exceeds the savings. If your static content is shorter than the minimum, consider padding it with useful reference material that benefits the model's responses.

Step 2: Structure your prompt for caching.
Reorganize your prompt so that static content comes first and dynamic content comes last. The cache works on prefixes, meaning it caches from the beginning of the prompt up to the cache control marker. Any content after the marker is processed normally. If your system prompt is at position 3 in the message array (after dynamic content), it cannot be cached because the prefix is different on every call.

# Optimal ordering for prompt caching
messages = [
    # 1. System prompt (static, cacheable)
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a customer support assistant...",
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    # 2. Conversation history (dynamic, not cached)
    {"role": "user", "content": "Previous message..."},
    {"role": "assistant", "content": "Previous response..."},
    # 3. Current user message (dynamic, not cached)
    {"role": "user", "content": "Current question..."}
]

Step 3: Add cache control markers.
Add the cache_control parameter to the content blocks you want to cache. The marker tells the API where the cacheable prefix ends. You can place up to 4 cache breakpoints in a single request, which is useful for caching multiple static sections with dynamic content between them.

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """Your detailed system prompt goes here.
Include all instructions, guardrails, and formatting
requirements. The longer this is, the more you save
from caching.""",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    tools=[
        {
            "name": "search_knowledge",
            "description": "Search the knowledge base",
            "input_schema": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            },
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How do I reset my password?"}
    ]
)

Step 4: Verify caching with usage headers.
The API response includes usage fields that show how many tokens were served from cache versus processed fresh. Check cache_creation_input_tokens (tokens processed and cached for the first time) and cache_read_input_tokens (tokens served from an existing cache). On the first call, you will see cache_creation tokens. On subsequent calls with the same prefix, you will see cache_read tokens.

# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")
print(f"Output tokens: {usage.output_tokens}")

# First call: cache_creation = ~5000, cache_read = 0
# Subsequent calls: cache_creation = 0, cache_read = ~5000

Step 5: Calculate your cost savings.
Compare the cost with and without caching. For Claude Sonnet 4.6, input tokens cost $3 per million and cached tokens cost $0.30 per million. Cache creation has a small premium (25% above normal input cost) that is amortized across all subsequent cache hits.

def calculate_savings(cached_tokens, calls_per_day, days=30):
    normal_cost_per_m = 3.00  # Claude Sonnet input
    cached_cost_per_m = 0.30  # 90% discount
    creation_cost_per_m = 3.75  # 25% premium for first write

    # One cache creation per 5-minute window
    creations_per_day = (24 * 60) / 5  # 288
    reads_per_day = calls_per_day - creations_per_day

    daily_without = (cached_tokens / 1_000_000) * normal_cost_per_m * calls_per_day
    daily_with = (
        (cached_tokens / 1_000_000) * creation_cost_per_m * creations_per_day +
        (cached_tokens / 1_000_000) * cached_cost_per_m * reads_per_day
    )

    monthly_savings = (daily_without - daily_with) * days
    return monthly_savings

# Example: 5000 cached tokens, 10000 calls/day
savings = calculate_savings(5000, 10000)
# Approximately $4,000/month savings

Step 6: Optimize cache hit rate.
Maximize your cache hit rate by ensuring the cached prefix is identical across requests. Even a single character difference in the cached section causes a cache miss. Avoid including timestamps, request IDs, or any variable content in the cached section. Monitor your cache_read vs cache_creation ratio. A healthy application should see cache reads on over 95% of requests.

Common Mistakes

Including dynamic content (user names, timestamps, session IDs) in the system prompt. Move these to a separate, uncached message.
Placing the system prompt after dynamic messages. Cache works on prefixes, so the system prompt must come first.
Not meeting the minimum token threshold. If your system prompt is under 1,024 tokens, it will not be cached regardless of the cache_control marker.
Modifying the system prompt frequently. Every change invalidates the cache. Finalize your system prompt before enabling caching.

Combine prompt caching with Adaptive Recall's memory system to minimize both token costs and context window usage.

Try It Free

How to Implement Prompt Caching with Anthropic

How Prompt Caching Works

Step-by-Step Implementation

Common Mistakes

Related Articles