How to Implement Sliding Window Conversations
Why Sliding Windows Are Necessary
Without a sliding window, every message ever sent accumulates in the conversation context. A 20-turn conversation might use 8,000 tokens of history. A 100-turn conversation might use 40,000. Eventually, the history exceeds the model's context window and something breaks, either the provider truncates silently, the API returns an error, or the system prompt gets pushed out by the volume of history.
A sliding window solves this by keeping token usage bounded regardless of conversation length. The window holds the last N messages (typically the last 6 to 10 turns), and everything older is compressed into a summary. As new messages arrive, the oldest messages in the window are summarized and merged into the running summary. The total token usage never exceeds the window size plus the summary size, which makes overflow impossible.
Step-by-Step Implementation
Start with your model's context limit and subtract the tokens consumed by other components: system prompt, tool definitions, retrieved context, and the response budget. The remainder is your conversation history budget. Divide that by the average tokens per message to get a rough message count for the window.
class SlidingWindowConfig:
def __init__(self, model_limit=128000):
self.model_limit = model_limit
self.system_prompt_budget = 3000
self.tool_budget = 2000
self.retrieval_budget = 4000
self.response_budget = 4000
self.safety_margin = 1000
self.history_budget = (
self.model_limit
- self.system_prompt_budget
- self.tool_budget
- self.retrieval_budget
- self.response_budget
- self.safety_margin
)
self.summary_budget = int(self.history_budget * 0.2)
self.recent_budget = self.history_budget - self.summary_budgetFor a 128k model with the budgets above, the history budget is 114,000 tokens, with 22,800 reserved for the running summary and 91,200 for recent messages. For a 16k model, the numbers are much tighter: roughly 2,000 tokens for the summary and 600 for recent messages, which means more aggressive summarization and a shorter window.
Store all messages in a persistent buffer with their token counts. The buffer holds every message from the conversation, not just the ones in the current window. This lets you rebuild the summary if needed and provides an audit trail of the full conversation.
class MessageBuffer:
def __init__(self, tokenizer):
self.messages = []
self.tokenizer = tokenizer
self.summary = ""
self.summary_tokens = 0
def add(self, role, content):
tokens = self.tokenizer.count(content) + 4
self.messages.append({
"role": role,
"content": content,
"tokens": tokens,
"index": len(self.messages)
})
def recent_tokens(self, n):
return sum(m["tokens"] for m in self.messages[-n:])
def total_recent_tokens(self):
return sum(m["tokens"] for m in self.messages) + self.summary_tokensBefore every API call, check whether the total context exceeds the budget. If it does, summarize the oldest unsummarized messages until the total fits. The summarization should capture key decisions, user preferences, established facts, and unresolved questions. It should not include greetings, acknowledgments, or tangential content.
def maybe_summarize(self, config):
total = self.total_recent_tokens()
if total <= config.history_budget:
return
messages_to_summarize = []
tokens_to_free = total - config.history_budget
freed = 0
while freed < tokens_to_free and len(self.messages) > 4:
msg = self.messages.pop(0)
messages_to_summarize.append(msg)
freed += msg["tokens"]
if messages_to_summarize:
new_context = self._generate_summary(messages_to_summarize)
if self.summary:
self.summary = self._merge_summaries(self.summary, new_context)
else:
self.summary = new_context
self.summary_tokens = self.tokenizer.count(self.summary)Insert the running summary as the first user-visible message in the conversation, after the system prompt. Frame it as context from earlier in the conversation so the model understands it is background information, not a new instruction. Use a clear label so the summary is distinguishable from actual messages.
def build_context(self, system_prompt):
messages = [{"role": "system", "content": system_prompt}]
if self.summary:
messages.append({
"role": "system",
"content": f"Context from earlier in this conversation:\n{self.summary}"
})
for msg in self.messages:
messages.append({
"role": msg["role"],
"content": msg["content"]
})
return messagesSometimes a recent message references something from earlier in the conversation ("use the same approach we discussed for the API"). If the referenced content has been summarized, the model needs that reference to be present in the summary. Detect references to earlier content by looking for phrases like "as we discussed," "the approach from earlier," or "the one we talked about" and ensure those topics appear in the summary.
A practical approach is to prompt the summarizer with both the messages to summarize and the most recent messages. This lets the summarizer see what topics are currently active and prioritize those in the summary. Topics that are both historically important and currently referenced get full representation, while topics that were discussed once and never mentioned again get minimal coverage.
Run automated tests with conversations of 50, 100, and 200 turns. At each checkpoint, ask the model to recall key decisions from early in the conversation, explain the current state of the discussion, and perform a task that requires context from multiple conversation phases. These tests catch summarization quality issues before they affect real users.
Sliding Windows vs External Memory
Sliding windows are a conversation-level solution. They manage the history of a single conversation but do not persist knowledge across conversations. When the conversation ends, the sliding window is discarded.
External memory persists across conversations. A fact learned in one conversation is available in every future conversation without being re-injected. For applications that need cross-session continuity, sliding windows handle the within-session problem while external memory handles the between-session problem. The most robust applications use both: a sliding window for the current conversation and external memory for persistent knowledge.
Combine sliding windows with Adaptive Recall for both within-session and cross-session memory. Conversations stay coherent no matter how long they run.
Try It Free