How to Summarize Long Conversations for Context
Before You Start
You need a chatbot that holds conversations long enough to exceed or approach context window limits, typically more than 15 to 20 turns. If your average conversation is 5 turns, summarization is unnecessary overhead. You also need metrics on your current context usage: average tokens per turn, average conversation length, system prompt size, and how much headroom you have before hitting the context limit. These numbers tell you when summarization should trigger and how aggressively it should compress. Finally, decide what "important" means for your application. A customer support chatbot prioritizes: problem description, troubleshooting steps tried, resolution status. A personal assistant prioritizes: decisions made, action items, preferences expressed. Your summarization prompt must encode these priorities.
Step-by-Step Implementation
Three strategies exist, each optimized for different scenarios. Progressive summarization runs incrementally as the conversation grows, summarizing the oldest N turns and prepending the result to a running summary. This is the most common approach because it bounds context size without requiring a large one-time summarization of the full history. Periodic full summarization waits until the context approaches a threshold (say, 80 percent of the window), then summarizes the entire conversation into a compact form and replaces the history with the summary plus the most recent messages. This approach is simpler to implement but introduces latency spikes when summarization triggers. Extractive summarization pulls discrete facts from the conversation into separate memory entries rather than producing a narrative summary. This approach works best when the goal is long-term memory rather than within-conversation context management, because extracted facts are individually retrievable by future queries. Most production systems use progressive summarization for within-session context management and extractive summarization for long-term memory storage, running both in parallel.
Progressive summarization processes the conversation in chunks. After every N turns (typically 5 to 8), take the oldest unsummarized turns, combine them with the existing running summary, and produce an updated running summary. The prompt should instruct the model to: preserve all factual information (names, numbers, dates, decisions), maintain the chronological arc of the conversation (what was discussed in what order), note any unresolved questions or pending items, and discard greetings, acknowledgments, filler phrases, and repetitive exchanges. The output should be 200 to 500 tokens regardless of input length, with the model compressing more aggressively as the conversation grows. Use a smaller, cheaper model for summarization (Claude Haiku or GPT-4o-mini) since the task is compression rather than creative generation.
class ProgressiveSummarizer:
def __init__(self, summarize_every=6, max_summary_tokens=500):
self.summarize_every = summarize_every
self.max_summary_tokens = max_summary_tokens
self.running_summary = ""
self.unsummarized_start = 0
async def maybe_summarize(self, messages):
unsummarized = messages[self.unsummarized_start:]
if len(unsummarized) < self.summarize_every:
return
chunk_to_summarize = unsummarized[:self.summarize_every]
prompt = f"""Update this running conversation summary with the new messages.
Current summary:
{self.running_summary or '(conversation just started)'}
New messages to incorporate:
{format_messages(chunk_to_summarize)}
Rules:
- Preserve all facts, decisions, names, numbers, dates
- Note unresolved questions
- Discard greetings, filler, repetitive exchanges
- Keep under {self.max_summary_tokens} tokens
- Write in third person past tense"""
self.running_summary = await summarize_with_llm(prompt)
self.unsummarized_start += self.summarize_every
def build_context(self, messages, recent_count=8):
recent = messages[-recent_count:]
return {
"summary": self.running_summary,
"recent_messages": recent
}Separately from progressive summarization (which manages the context window), run an extractive pass that pulls discrete facts from the conversation for storage in persistent memory. Extraction differs from summarization in three ways: it produces multiple independent outputs (one per fact) rather than a single narrative, it strips all conversational context to produce standalone statements, and it categorizes each fact for structured retrieval. Run extraction at conversation end or periodically during long conversations. Each extracted fact is stored as an individual memory with its own embedding, entities, and metadata, making it independently searchable in future sessions. This is where the conversation's ephemeral content becomes durable knowledge.
async def extract_memories_from_conversation(messages, user_id):
prompt = """Extract discrete facts from this conversation that would be
useful to remember in future interactions with this user.
For each fact, provide:
- content: a clear, standalone sentence
- category: preference / decision / project / personal / technical
- importance: high / medium / low
Only extract information that has lasting value. Skip:
- Greetings and social pleasantries
- Temporary states ("I'm frustrated right now")
- Information that changes rapidly ("the server is down")
- Anything already known from earlier in the conversation
Return a JSON array of extracted facts."""
result = await llm_extract(prompt, format_messages(messages))
facts = json.loads(result)
for fact in facts:
if fact["importance"] != "low":
await memory_service.store(
content=fact["content"],
metadata={
"user_id": user_id,
"category": fact["category"],
"source": "conversation_extraction"
}
)Combine the running summary, recent messages, and recalled memories into an optimized context that gives the model the best possible information within the token budget. The structure should be: system prompt (fixed, typically 500 to 2,000 tokens), recalled memories (dynamic, 200 to 800 tokens, retrieved based on the current message), running conversation summary (dynamic, 200 to 500 tokens, covering older parts of the conversation), and recent messages (dynamic, 1,500 to 4,000 tokens, the last 5 to 10 turns in full). This ordering matters: the system prompt establishes behavior, recalled memories provide cross-session context, the summary provides within-session history, and recent messages provide the immediate conversational thread. The model processes them in this order and gives appropriate weight to each layer.
Summarization quality is notoriously hard to measure automatically because it requires understanding what information is "important" in context. Three practical approaches: factual preservation testing creates conversations with known facts embedded in them, runs summarization, and checks whether all facts are preserved in the output. A preservation rate below 95 percent for critical facts (names, numbers, decisions) indicates a prompt quality issue. Downstream quality testing measures whether the chatbot's responses are better or worse with summarized context compared to full history. Use a set of test conversations that exceed the context window and compare response quality (human rated) with truncation (dropping oldest messages) versus summarization. Summarization should produce equal or better responses than truncation for a small fraction of the cost. Information loss auditing periodically reviews summarized conversations and flags cases where important context was lost, feeding these failures back into the summarization prompt as negative examples.
Summarization costs money (an LLM call per summarization cycle) and adds latency (the summarization call takes 200 to 800 ms). Optimize by: using the cheapest model that maintains quality (Claude Haiku or GPT-4o-mini for narrative summarization, since this is a compression task that does not require the strongest reasoning), batching summarization to run asynchronously between turns rather than blocking the response, increasing the summarization interval (every 8 turns instead of every 5) if quality metrics show minimal degradation, and caching summaries so re-summarization is not needed when the same conversation segment is referenced. The total cost of summarization should be less than the cost savings from reduced context, because sending 500 tokens of summary instead of 5,000 tokens of raw history saves roughly 4,500 input tokens per turn. At $3 per million tokens, that saves $0.0135 per turn, which easily pays for the periodic summarization call.
Summarization and Long-Term Memory
The most powerful architecture runs both progressive summarization and extractive memory in parallel. Progressive summarization manages the immediate conversation by keeping the context window bounded. Extractive memory captures durable knowledge for future sessions. Together, they ensure that a 50-turn conversation uses the same context budget as a 10-turn conversation (through summarization) while also producing 15 to 20 discrete memories that will be available in every future session with this user (through extraction).
Adaptive Recall's consolidation process performs a similar function at the memory level: periodically reviewing stored memories, merging redundant ones, updating confidence scores based on corroboration, and producing synthesized knowledge from patterns across multiple conversations. This means the summarization pipeline does not need to be perfect. Even if extraction misses a fact or stores it redundantly, the consolidation process will clean it up. The combination of conversation-level extraction and memory-level consolidation produces a system that accumulates clean, accurate, deduplicated knowledge over time with minimal manual intervention.
Let conversations become knowledge. Adaptive Recall extracts, stores, and consolidates facts from every conversation automatically, building a memory that improves with every interaction.
Get Started Free