Home » Building AI Assistants » Conversation History

How to Add Conversation History to an AI Assistant

Conversation history gives an AI assistant awareness of what has been discussed within a session and, with the right architecture, across sessions. Implementing it requires storing messages persistently, managing the model's context window when conversations get long, summarizing older exchanges to preserve key facts without consuming excessive tokens, and extracting knowledge from conversations into persistent memory for cross-session continuity.

Before You Start

You need a working assistant that handles at least single-turn conversations. This guide adds multi-turn awareness and cross-session memory to that foundation. You also need a persistent storage backend (Redis, PostgreSQL, DynamoDB, or equivalent) for storing message history. In-memory storage works for prototyping but loses all history when the server restarts, which is unacceptable for production.

Step-by-Step Setup

Step 1: Store messages in a persistent backend.
Every message in every conversation needs to be persisted so that history is available when the user sends their next message, even after a page refresh, server restart, or load balancer routing change. Store each message with: the conversation ID (which session this belongs to), the role (user, assistant, system, tool), the content, a timestamp, and any metadata (tool call IDs, token counts, or model information).

# Example: message storage with Redis
import json
import time

class ConversationStore:
    def __init__(self, redis_client):
        self.redis = redis_client

    def add_message(self, conversation_id, role, content, metadata=None):
        message = {
            "role": role,
            "content": content,
            "timestamp": time.time(),
            "metadata": metadata or {}
        }
        self.redis.rpush(
            f"conv:{conversation_id}",
            json.dumps(message)
        )
        # Set TTL for automatic cleanup (30 days)
        self.redis.expire(f"conv:{conversation_id}", 86400 * 30)

    def get_history(self, conversation_id, limit=None):
        messages = self.redis.lrange(f"conv:{conversation_id}", 0, -1)
        history = [json.loads(m) for m in messages]
        if limit:
            history = history[-limit:]
        return history

Step 2: Manage context window limits.
Language models have fixed context windows. When a conversation generates more tokens than the window can hold, you need a strategy for deciding what stays and what goes. The simplest approach is truncation: keep the most recent N messages and drop older ones. This is easy to implement but loses information unpredictably, since important context from early in the conversation disappears while irrelevant recent messages stay.

A better approach is selective retention. Keep the system prompt (always), the most recent messages (usually the last 5 to 10 turns), and a summary of everything in between. This preserves the assistant's instructions, the current conversational context, and the key facts from earlier exchanges while staying within the token budget.

import tiktoken

def fit_to_context(messages, system_prompt, max_tokens=100000):
    encoder = tiktoken.encoding_for_model("gpt-4o")

    # Reserve tokens for system prompt and response
    system_tokens = len(encoder.encode(system_prompt))
    response_reserve = 4096
    available = max_tokens - system_tokens - response_reserve

    # Always keep the most recent messages
    recent = messages[-10:]
    recent_tokens = sum(len(encoder.encode(m["content"])) for m in recent)

    if recent_tokens >= available:
        # Even recent messages exceed budget, keep fewer
        fitted = []
        used = 0
        for msg in reversed(recent):
            msg_tokens = len(encoder.encode(msg["content"]))
            if used + msg_tokens > available:
                break
            fitted.insert(0, msg)
            used += msg_tokens
        return fitted

    # Fit older messages into remaining space
    remaining = available - recent_tokens
    older = messages[:-10]
    fitted_older = []
    used = 0
    for msg in reversed(older):
        msg_tokens = len(encoder.encode(msg["content"]))
        if used + msg_tokens > remaining:
            break
        fitted_older.insert(0, msg)
        used += msg_tokens

    return fitted_older + recent

Step 3: Add conversation summarization.
When older messages are dropped from the context, their information is lost unless you summarize them first. Summarization condenses the dropped messages into a compact paragraph that preserves key facts, decisions, and open items while dramatically reducing token count. The summary replaces the original messages in the context, maintaining continuity without the token cost.

The summarization itself can be done by the same language model, using a dedicated summarization prompt that instructs it to extract key facts, decisions, user preferences, and open questions while dropping pleasantries, redundant exchanges, and thinking-out-loud passages. Run summarization when the conversation crosses a token threshold (for example, when older messages would need to be dropped to fit the context window).

async def summarize_older_messages(messages, model_client):
    if len(messages) <= 10:
        return messages  # No summarization needed

    older = messages[:-10]
    recent = messages[-10:]

    summary_prompt = """Summarize this conversation excerpt into a concise paragraph.
    Preserve: key facts stated, decisions made, user preferences expressed,
    open questions or tasks. Drop: greetings, redundant exchanges, filler.
    Write in third person: 'The user said...' 'The assistant found...'"""

    older_text = "\n".join([f"{m['role']}: {m['content']}" for m in older])
    summary = await model_client.generate(
        system=summary_prompt,
        messages=[{"role": "user", "content": older_text}]
    )

    summary_message = {
        "role": "system",
        "content": f"Summary of earlier conversation:\n{summary.text}"
    }

    return [summary_message] + recent

Step 4: Implement cross-session continuity.
Conversation history handles continuity within a session. Cross-session continuity requires extracting the important knowledge from a conversation and storing it in persistent memory that can be retrieved in future sessions. At the end of each conversation (or continuously during it), extract facts, preferences, and decisions and store them as discrete memories. At the start of each new conversation, retrieve relevant memories and include them in the context.

The distinction between conversation history and persistent memory is critical. History is a transcript of what was said. Memory is extracted knowledge: the important facts, preferences, and decisions distilled from many conversations. An assistant that only uses history forgets everything between sessions. An assistant that extracts and stores memory builds a growing understanding of the user and their context that makes every subsequent conversation more efficient and personalized.

async def end_session_extraction(conversation_id, store, memory_client):
    history = store.get_history(conversation_id)

    extraction_prompt = """Review this conversation and extract any information
    worth remembering for future sessions. Include:
    - Facts the user stated about their project, preferences, or situation
    - Decisions made during the conversation
    - Corrections to previously known information
    Return a JSON array of strings, each a discrete fact to remember.
    Return an empty array if nothing is worth storing long-term."""

    full_text = "\n".join([f"{m['role']}: {m['content']}" for m in history])
    result = await model_client.generate(
        system=extraction_prompt,
        messages=[{"role": "user", "content": full_text}]
    )

    memories = json.loads(result.text)
    for memory in memories:
        await memory_client.store(content=memory)

History vs Memory: Understanding the Difference

A common architectural mistake is treating conversation history as the memory system. History is a sequential log of messages. Memory is organized, searchable, and curated knowledge. History grows linearly with every message and becomes unwieldy quickly. Memory is compact because related information is consolidated and irrelevant information is discarded. History is tied to a specific conversation. Memory spans all conversations and provides cross-session intelligence.

The best architecture uses both: conversation history for within-session context (what was just said, what the current task is, what tools have been called) and persistent memory for cross-session knowledge (who the user is, what their project does, what decisions they have made). Adaptive Recall provides the memory layer while your application manages the history layer, and the two work together to give the assistant both immediate awareness and long-term understanding.

Conversation history handles the current session. Adaptive Recall handles everything else, providing persistent memory that turns past conversations into future context.

Get Started Free

How to Add Conversation History to an AI Assistant

Before You Start

Step-by-Step Setup

History vs Memory: Understanding the Difference

Related Articles