Home » AI Memory » Add Persistent Memory

How to Add Persistent Memory to Any LLM App

Adding persistent memory to an LLM application requires three components: an extraction layer that identifies what to remember, a storage layer that persists memories as searchable vectors, and a retrieval layer that injects relevant context into future prompts. This guide walks through each component with working code examples.

Before You Start

You need an existing application that calls an LLM API (OpenAI, Anthropic, or any provider). You also need a storage backend for memories. This guide uses a vector database for storage, but you can substitute any searchable store. If you want to skip building the infrastructure yourself, Adaptive Recall provides all three layers as a managed service through MCP or REST API.

The approach here is provider-agnostic. It works with any LLM that accepts a system message, which is every major model available in 2026. The memory layer sits between your application and the LLM, intercepting conversations to extract memories and enriching prompts with retrieved context.

Step-by-Step Implementation

Step 1: Choose a storage backend.
Your memories need a home. Vector databases are the standard choice because they support semantic search, letting you find memories by meaning rather than exact text match. Popular options include pgvector (if you already run PostgreSQL), Pinecone (managed service), Qdrant (open source), or Weaviate (open source with cloud option).

For this guide, we use a simple in-process vector store to keep the focus on the memory logic. In production, replace this with your chosen database. The interface is the same: embed, store, search.

import numpy as np
from openai import OpenAI

client = OpenAI()

class MemoryStore:
    def __init__(self):
        self.memories = []
        self.vectors = []

    def embed(self, text):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def store(self, text, metadata=None):
        vector = self.embed(text)
        self.memories.append({
            "text": text,
            "metadata": metadata or {},
            "created_at": time.time()
        })
        self.vectors.append(vector)

    def search(self, query, top_k=5):
        query_vec = self.embed(query)
        scores = [
            np.dot(query_vec, v) / (np.linalg.norm(query_vec) * np.linalg.norm(v))
            for v in self.vectors
        ]
        ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
        return [self.memories[i] for i, _ in ranked[:top_k]]

Step 2: Build the extraction layer.
After each conversation turn, extract noteworthy information to store as memories. The most effective approach uses the LLM itself for extraction. Send the conversation to the model with instructions to identify facts, preferences, decisions, and observations worth remembering.

def extract_memories(conversation_text, user_id):
    extraction_prompt = """Review this conversation and extract information
worth remembering for future interactions. Focus on:
- User preferences and stated requirements
- Factual information about their project or situation
- Decisions that were made and their reasoning
- Technical details that would be useful later

Return each memory as a separate line. Return NONE if nothing
is worth storing."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": extraction_prompt},
            {"role": "user", "content": conversation_text}
        ],
        temperature=0.2
    )

    memories_text = response.choices[0].message.content
    if memories_text.strip() == "NONE":
        return []

    return [
        line.strip()
        for line in memories_text.split("\n")
        if line.strip()
    ]

Extraction cost. Each extraction call uses tokens proportional to the conversation length. For cost efficiency, extract after meaningful exchanges rather than after every single message. Batching extraction at the end of a session is a common pattern.

Step 3: Embed and store memories.
Each extracted memory gets embedded as a vector and stored with metadata including the user ID, timestamp, and source context. The metadata enables filtering during retrieval (only show memories for this user, prefer recent memories).

import time

memory_store = MemoryStore()

def save_memories(conversation_text, user_id):
    extracted = extract_memories(conversation_text, user_id)
    for memory_text in extracted:
        memory_store.store(
            text=memory_text,
            metadata={
                "user_id": user_id,
                "source": "conversation",
                "timestamp": time.time()
            }
        )

Step 4: Build the retrieval layer.
At the start of each session or before each model call, retrieve memories relevant to the current context. The query can be the user's latest message, a summary of the conversation so far, or a combination of both.

def retrieve_context(query, user_id, max_memories=5):
    results = memory_store.search(query, top_k=max_memories)
    # Filter to only this user's memories
    user_memories = [
        m for m in results
        if m["metadata"].get("user_id") == user_id
    ]
    return user_memories

Step 5: Inject context into prompts.
Format the retrieved memories and add them to the system message. The model reads this context and incorporates it into its responses naturally. Include timestamps so the model can assess recency.

def build_system_message(base_prompt, memories):
    if not memories:
        return base_prompt

    memory_block = "\n\nRelevant context from previous interactions:\n"
    for m in memories:
        age = time.time() - m["metadata"]["timestamp"]
        days_ago = int(age / 86400)
        time_label = f"{days_ago} days ago" if days_ago > 0 else "today"
        memory_block += f"- ({time_label}) {m['text']}\n"

    return base_prompt + memory_block


def chat_with_memory(user_message, user_id, base_prompt):
    # Retrieve relevant memories
    memories = retrieve_context(user_message, user_id)
    system_msg = build_system_message(base_prompt, memories)

    # Call the LLM with memory-enriched context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_message}
        ]
    )

    return response.choices[0].message.content

Step 6: Test memory recall.
Verify that the full loop works: store some memories in one session, then start a new session and confirm the model uses the stored context. Test with specific facts that the model would not know without memory access.

# Session 1: Store information
save_memories(
    "User said they are building a React app with PostgreSQL backend "
    "and prefer TypeScript. They deploy to AWS ECS.",
    user_id="user_123"
)

# Session 2: New conversation, memory should be used
response = chat_with_memory(
    "What database should I use for this new feature?",
    user_id="user_123",
    base_prompt="You are a helpful coding assistant."
)
# Model should reference PostgreSQL from stored memory

Going Beyond Basic Memory

The implementation above handles the core memory loop, but production systems need additional capabilities. Deduplication prevents storing the same fact multiple times. Contradiction detection identifies when new information conflicts with stored memories. Consolidation merges related memories into more compact, useful summaries. Confidence scoring tracks which memories have been validated by repeated use versus those based on a single mention.

Retrieval quality also improves dramatically with cognitive scoring. Instead of ranking by vector similarity alone, cognitive scoring factors in how recently a memory was accessed (recency), how often it has been retrieved (frequency), how it connects to other relevant memories through entity relationships (spreading activation), and how well corroborated it is (confidence). This is the approach that Adaptive Recall uses, based on the ACT-R cognitive architecture from decades of memory research.

Skip building memory infrastructure from scratch. Adaptive Recall handles extraction, storage, cognitive retrieval, and lifecycle management through a single API.

Get Started Free

How to Add Persistent Memory to Any LLM App

Before You Start

Step-by-Step Implementation

Going Beyond Basic Memory

Related Articles