How to Add Persistent Memory to Any LLM App
Before You Start
You need an existing application that calls an LLM API (OpenAI, Anthropic, or any provider). You also need a storage backend for memories. This guide uses a vector database for storage, but you can substitute any searchable store. If you want to skip building the infrastructure yourself, Adaptive Recall provides all three layers as a managed service through MCP or REST API.
The approach here is provider-agnostic. It works with any LLM that accepts a system message, which is every major model available in 2026. The memory layer sits between your application and the LLM, intercepting conversations to extract memories and enriching prompts with retrieved context.
Step-by-Step Implementation
Your memories need a home. Vector databases are the standard choice because they support semantic search, letting you find memories by meaning rather than exact text match. Popular options include pgvector (if you already run PostgreSQL), Pinecone (managed service), Qdrant (open source), or Weaviate (open source with cloud option).
For this guide, we use a simple in-process vector store to keep the focus on the memory logic. In production, replace this with your chosen database. The interface is the same: embed, store, search.
import numpy as np
from openai import OpenAI
client = OpenAI()
class MemoryStore:
def __init__(self):
self.memories = []
self.vectors = []
def embed(self, text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def store(self, text, metadata=None):
vector = self.embed(text)
self.memories.append({
"text": text,
"metadata": metadata or {},
"created_at": time.time()
})
self.vectors.append(vector)
def search(self, query, top_k=5):
query_vec = self.embed(query)
scores = [
np.dot(query_vec, v) / (np.linalg.norm(query_vec) * np.linalg.norm(v))
for v in self.vectors
]
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
return [self.memories[i] for i, _ in ranked[:top_k]]After each conversation turn, extract noteworthy information to store as memories. The most effective approach uses the LLM itself for extraction. Send the conversation to the model with instructions to identify facts, preferences, decisions, and observations worth remembering.
def extract_memories(conversation_text, user_id):
extraction_prompt = """Review this conversation and extract information
worth remembering for future interactions. Focus on:
- User preferences and stated requirements
- Factual information about their project or situation
- Decisions that were made and their reasoning
- Technical details that would be useful later
Return each memory as a separate line. Return NONE if nothing
is worth storing."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": extraction_prompt},
{"role": "user", "content": conversation_text}
],
temperature=0.2
)
memories_text = response.choices[0].message.content
if memories_text.strip() == "NONE":
return []
return [
line.strip()
for line in memories_text.split("\n")
if line.strip()
]Each extracted memory gets embedded as a vector and stored with metadata including the user ID, timestamp, and source context. The metadata enables filtering during retrieval (only show memories for this user, prefer recent memories).
import time
memory_store = MemoryStore()
def save_memories(conversation_text, user_id):
extracted = extract_memories(conversation_text, user_id)
for memory_text in extracted:
memory_store.store(
text=memory_text,
metadata={
"user_id": user_id,
"source": "conversation",
"timestamp": time.time()
}
)At the start of each session or before each model call, retrieve memories relevant to the current context. The query can be the user's latest message, a summary of the conversation so far, or a combination of both.
def retrieve_context(query, user_id, max_memories=5):
results = memory_store.search(query, top_k=max_memories)
# Filter to only this user's memories
user_memories = [
m for m in results
if m["metadata"].get("user_id") == user_id
]
return user_memoriesFormat the retrieved memories and add them to the system message. The model reads this context and incorporates it into its responses naturally. Include timestamps so the model can assess recency.
def build_system_message(base_prompt, memories):
if not memories:
return base_prompt
memory_block = "\n\nRelevant context from previous interactions:\n"
for m in memories:
age = time.time() - m["metadata"]["timestamp"]
days_ago = int(age / 86400)
time_label = f"{days_ago} days ago" if days_ago > 0 else "today"
memory_block += f"- ({time_label}) {m['text']}\n"
return base_prompt + memory_block
def chat_with_memory(user_message, user_id, base_prompt):
# Retrieve relevant memories
memories = retrieve_context(user_message, user_id)
system_msg = build_system_message(base_prompt, memories)
# Call the LLM with memory-enriched context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_msg},
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.contentVerify that the full loop works: store some memories in one session, then start a new session and confirm the model uses the stored context. Test with specific facts that the model would not know without memory access.
# Session 1: Store information
save_memories(
"User said they are building a React app with PostgreSQL backend "
"and prefer TypeScript. They deploy to AWS ECS.",
user_id="user_123"
)
# Session 2: New conversation, memory should be used
response = chat_with_memory(
"What database should I use for this new feature?",
user_id="user_123",
base_prompt="You are a helpful coding assistant."
)
# Model should reference PostgreSQL from stored memoryGoing Beyond Basic Memory
The implementation above handles the core memory loop, but production systems need additional capabilities. Deduplication prevents storing the same fact multiple times. Contradiction detection identifies when new information conflicts with stored memories. Consolidation merges related memories into more compact, useful summaries. Confidence scoring tracks which memories have been validated by repeated use versus those based on a single mention.
Retrieval quality also improves dramatically with cognitive scoring. Instead of ranking by vector similarity alone, cognitive scoring factors in how recently a memory was accessed (recency), how often it has been retrieved (frequency), how it connects to other relevant memories through entity relationships (spreading activation), and how well corroborated it is (confidence). This is the approach that Adaptive Recall uses, based on the ACT-R cognitive architecture from decades of memory research.
Skip building memory infrastructure from scratch. Adaptive Recall handles extraction, storage, cognitive retrieval, and lifecycle management through a single API.
Get Started Free