How to Add Conversation History to an AI Assistant
Before You Start
You need a working assistant that handles at least single-turn conversations. This guide adds multi-turn awareness and cross-session memory to that foundation. You also need a persistent storage backend (Redis, PostgreSQL, DynamoDB, or equivalent) for storing message history. In-memory storage works for prototyping but loses all history when the server restarts, which is unacceptable for production.
Step-by-Step Setup
Every message in every conversation needs to be persisted so that history is available when the user sends their next message, even after a page refresh, server restart, or load balancer routing change. Store each message with: the conversation ID (which session this belongs to), the role (user, assistant, system, tool), the content, a timestamp, and any metadata (tool call IDs, token counts, or model information).
# Example: message storage with Redis
import json
import time
class ConversationStore:
def __init__(self, redis_client):
self.redis = redis_client
def add_message(self, conversation_id, role, content, metadata=None):
message = {
"role": role,
"content": content,
"timestamp": time.time(),
"metadata": metadata or {}
}
self.redis.rpush(
f"conv:{conversation_id}",
json.dumps(message)
)
# Set TTL for automatic cleanup (30 days)
self.redis.expire(f"conv:{conversation_id}", 86400 * 30)
def get_history(self, conversation_id, limit=None):
messages = self.redis.lrange(f"conv:{conversation_id}", 0, -1)
history = [json.loads(m) for m in messages]
if limit:
history = history[-limit:]
return historyLanguage models have fixed context windows. When a conversation generates more tokens than the window can hold, you need a strategy for deciding what stays and what goes. The simplest approach is truncation: keep the most recent N messages and drop older ones. This is easy to implement but loses information unpredictably, since important context from early in the conversation disappears while irrelevant recent messages stay.
A better approach is selective retention. Keep the system prompt (always), the most recent messages (usually the last 5 to 10 turns), and a summary of everything in between. This preserves the assistant's instructions, the current conversational context, and the key facts from earlier exchanges while staying within the token budget.
import tiktoken
def fit_to_context(messages, system_prompt, max_tokens=100000):
encoder = tiktoken.encoding_for_model("gpt-4o")
# Reserve tokens for system prompt and response
system_tokens = len(encoder.encode(system_prompt))
response_reserve = 4096
available = max_tokens - system_tokens - response_reserve
# Always keep the most recent messages
recent = messages[-10:]
recent_tokens = sum(len(encoder.encode(m["content"])) for m in recent)
if recent_tokens >= available:
# Even recent messages exceed budget, keep fewer
fitted = []
used = 0
for msg in reversed(recent):
msg_tokens = len(encoder.encode(msg["content"]))
if used + msg_tokens > available:
break
fitted.insert(0, msg)
used += msg_tokens
return fitted
# Fit older messages into remaining space
remaining = available - recent_tokens
older = messages[:-10]
fitted_older = []
used = 0
for msg in reversed(older):
msg_tokens = len(encoder.encode(msg["content"]))
if used + msg_tokens > remaining:
break
fitted_older.insert(0, msg)
used += msg_tokens
return fitted_older + recentWhen older messages are dropped from the context, their information is lost unless you summarize them first. Summarization condenses the dropped messages into a compact paragraph that preserves key facts, decisions, and open items while dramatically reducing token count. The summary replaces the original messages in the context, maintaining continuity without the token cost.
The summarization itself can be done by the same language model, using a dedicated summarization prompt that instructs it to extract key facts, decisions, user preferences, and open questions while dropping pleasantries, redundant exchanges, and thinking-out-loud passages. Run summarization when the conversation crosses a token threshold (for example, when older messages would need to be dropped to fit the context window).
async def summarize_older_messages(messages, model_client):
if len(messages) <= 10:
return messages # No summarization needed
older = messages[:-10]
recent = messages[-10:]
summary_prompt = """Summarize this conversation excerpt into a concise paragraph.
Preserve: key facts stated, decisions made, user preferences expressed,
open questions or tasks. Drop: greetings, redundant exchanges, filler.
Write in third person: 'The user said...' 'The assistant found...'"""
older_text = "\n".join([f"{m['role']}: {m['content']}" for m in older])
summary = await model_client.generate(
system=summary_prompt,
messages=[{"role": "user", "content": older_text}]
)
summary_message = {
"role": "system",
"content": f"Summary of earlier conversation:\n{summary.text}"
}
return [summary_message] + recentConversation history handles continuity within a session. Cross-session continuity requires extracting the important knowledge from a conversation and storing it in persistent memory that can be retrieved in future sessions. At the end of each conversation (or continuously during it), extract facts, preferences, and decisions and store them as discrete memories. At the start of each new conversation, retrieve relevant memories and include them in the context.
The distinction between conversation history and persistent memory is critical. History is a transcript of what was said. Memory is extracted knowledge: the important facts, preferences, and decisions distilled from many conversations. An assistant that only uses history forgets everything between sessions. An assistant that extracts and stores memory builds a growing understanding of the user and their context that makes every subsequent conversation more efficient and personalized.
async def end_session_extraction(conversation_id, store, memory_client):
history = store.get_history(conversation_id)
extraction_prompt = """Review this conversation and extract any information
worth remembering for future sessions. Include:
- Facts the user stated about their project, preferences, or situation
- Decisions made during the conversation
- Corrections to previously known information
Return a JSON array of strings, each a discrete fact to remember.
Return an empty array if nothing is worth storing long-term."""
full_text = "\n".join([f"{m['role']}: {m['content']}" for m in history])
result = await model_client.generate(
system=extraction_prompt,
messages=[{"role": "user", "content": full_text}]
)
memories = json.loads(result.text)
for memory in memories:
await memory_client.store(content=memory)History vs Memory: Understanding the Difference
A common architectural mistake is treating conversation history as the memory system. History is a sequential log of messages. Memory is organized, searchable, and curated knowledge. History grows linearly with every message and becomes unwieldy quickly. Memory is compact because related information is consolidated and irrelevant information is discarded. History is tied to a specific conversation. Memory spans all conversations and provides cross-session intelligence.
The best architecture uses both: conversation history for within-session context (what was just said, what the current task is, what tools have been called) and persistent memory for cross-session knowledge (who the user is, what their project does, what decisions they have made). Adaptive Recall provides the memory layer while your application manages the history layer, and the two work together to give the assistant both immediate awareness and long-term understanding.
Conversation history handles the current session. Adaptive Recall handles everything else, providing persistent memory that turns past conversations into future context.
Get Started Free