How to Build an AI Assistant with Persistent Memory
Before You Start
You need a working AI assistant or the scaffolding for one: a model provider account (Anthropic, OpenAI, or equivalent), a basic request-response loop, and an understanding of how your assistant's conversations flow. This guide adds memory to that foundation. If you are starting completely from scratch, read the AI assistant architecture overview first to understand the full component stack.
You also need to decide on a memory backend. You can build your own using a vector database and custom extraction logic, or you can use a memory service like Adaptive Recall that provides storage, retrieval, lifecycle management, and cognitive scoring through a single API. This guide covers both approaches, but the managed service path is significantly faster to production readiness.
Step-by-Step Setup
Before writing any code, decide what your assistant should remember. Different types of information serve different purposes and need different storage and retrieval strategies. User preferences (language, tone, formatting preferences, timezone) should be stored as persistent profile data that is loaded into every conversation. Factual knowledge (project details, team members, technology stack, business rules) should be stored as discrete facts with confidence scores. Task history (what the user asked for, what the assistant did, what the outcome was) should be stored as episodic memories with timestamps. Conversation summaries (condensed versions of previous sessions) provide continuity without consuming excessive context tokens.
If you are using Adaptive Recall, this step is connecting the service through MCP or REST. Add your server URL and API key to your application configuration. The service provides all seven memory operations (store, recall, update, forget, reflect, graph, status) out of the box, including entity extraction, confidence scoring, and knowledge graph construction.
If you are building your own memory backend, you need at minimum: a vector database for semantic search (Pinecone, Qdrant, Weaviate, or pgvector), an embedding model for converting text to vectors (OpenAI text-embedding-3-small or Voyage AI), a metadata store for timestamps, confidence scores, and user associations, and extraction logic for identifying important facts in conversation text.
# Example: Adaptive Recall MCP configuration
{
"mcpServers": {
"adaptive-recall": {
"type": "url",
"url": "https://your-instance.adaptiverecall.com/mcp",
"headers": {
"Authorization": "Bearer YOUR_API_KEY"
}
}
}
}After each conversation turn, your assistant needs to identify what is worth remembering and store it. This can happen synchronously (blocking the response until storage completes) or asynchronously (storing in the background after the response is sent). Asynchronous extraction is better for user experience because it does not add latency to responses.
The extraction process should identify: new facts the user stated ("our API uses OAuth2," "the deadline is March 15"), preferences the user expressed ("I prefer TypeScript," "keep responses concise"), decisions made during the conversation ("we decided to use PostgreSQL"), and corrections to previously stored information ("actually, the deadline moved to April 1"). Each extracted item becomes a discrete memory unit with metadata including the source conversation, timestamp, confidence level, and associated entities.
# Python example: memory extraction after assistant response
import json
async def extract_and_store_memories(conversation_turn, memory_client):
extraction_prompt = """Review this conversation turn and identify
any facts, preferences, or decisions worth remembering.
Return a JSON array of memory objects with 'content' and 'type' fields.
Only include information that would be useful in future conversations.
Return an empty array if nothing is worth storing."""
extraction = await model.generate(
system=extraction_prompt,
messages=[{"role": "user", "content": conversation_turn}]
)
memories = json.loads(extraction.text)
for memory in memories:
await memory_client.store(
content=memory["content"],
metadata={"type": memory["type"], "source": "conversation"}
)Before each model call, query the memory store for context relevant to the current conversation. The query should use the user's latest message (and optionally recent conversation context) as the search input. Retrieved memories are injected into the system prompt or as a separate context block that the model can reference when generating its response.
The number of memories to retrieve depends on your context budget. Retrieving too many memories wastes tokens on irrelevant information and can confuse the model. Retrieving too few risks missing important context. A good starting point is 5 to 15 memories per query, ranked by relevance. Cognitive scoring (which weights recency, access frequency, confidence, and entity connections alongside semantic similarity) produces better rankings than pure vector similarity because it prioritizes memories that are both relevant and current.
# Python example: context assembly with memory retrieval
async def build_context(user_message, user_id, memory_client):
relevant_memories = await memory_client.recall(
query=user_message,
user_id=user_id,
limit=10
)
memory_context = "## Relevant Context from Previous Conversations\n"
for mem in relevant_memories:
memory_context += f"- {mem['content']} (confidence: {mem['confidence']})\n"
system_prompt = BASE_SYSTEM_PROMPT + "\n\n" + memory_context
return system_promptMemories are not static. Facts change, preferences evolve, and old information becomes irrelevant. Your memory system needs mechanisms for updating existing memories (when a user corrects or updates previously stored information), consolidating related memories (merging fragments into comprehensive knowledge), and removing outdated or contradicted memories.
Adaptive Recall handles lifecycle management through its reflect tool (consolidation), update tool (modifications), and forget tool (removal). If you are building your own system, you need scheduled consolidation runs that identify related memories and merge them, contradiction detection that flags conflicting information for resolution, and decay mechanisms that gradually reduce the weight of memories that have not been accessed or corroborated.
Memory bugs are subtle and difficult to catch with unit tests alone. You need end-to-end tests that simulate multi-session conversations and verify that memories persist, retrieve accurately, and update correctly. Create test scenarios that cover: storing a fact in session one and retrieving it in session two, updating a fact and verifying the old value is replaced, storing contradictory facts and verifying that consolidation resolves them, and verifying that irrelevant memories do not surface for unrelated queries.
Common Pitfalls
The most common mistake is storing too much. If the assistant remembers every detail of every conversation, retrieval quality degrades because relevant memories are buried under noise. Be selective about what you extract: a memory should be something that would genuinely help the assistant in a future conversation. "The user asked about authentication" is too vague. "The user's project uses FastAPI with OAuth2PasswordBearer for authentication" is useful and specific.
The second most common mistake is treating all memories equally. A fact the user stated directly ("we use PostgreSQL 15") should carry higher confidence than something the assistant inferred ("based on the project structure, this appears to be a Django project"). Confidence scoring ensures that well-corroborated facts are weighted more heavily in retrieval than uncertain observations, which reduces the risk of the assistant acting on wrong information.
The third pitfall is neglecting memory updates. If the user tells the assistant in January that the deadline is March 15, and then tells it in February that the deadline moved to April 1, the memory system needs to update the deadline memory rather than storing a second, contradictory one. Without update logic, the assistant may retrieve either memory randomly, producing inconsistent responses.
Add persistent memory to your AI assistant in minutes. Adaptive Recall handles storage, retrieval, lifecycle management, and cognitive scoring so you can focus on your assistant's core functionality.
Get Started Free