Home » Memory for AI Agents

Memory for AI Agents

AI agents that run autonomously, orchestrate tools, and collaborate in multi-agent systems need memory that persists across sessions, survives restarts, and remains consistent when multiple agents read and write simultaneously. Without a dedicated memory layer, agents lose 15 to 30% of context on long-running tasks and repeat the same mistakes across sessions because every invocation starts from zero.

Why Agents Need Memory

The difference between a chatbot and an AI agent is autonomy. A chatbot responds to a single user message and waits for the next one. An agent takes a goal, decomposes it into tasks, executes those tasks over time, monitors results, and adapts its approach based on what it learns. This autonomy creates a memory requirement that conversation history alone cannot satisfy.

Conversation history is a log of what was said. Agent memory is a structured record of what was learned, what was tried, what worked, what failed, what the current state of each task is, and what relationships exist between entities encountered during execution. An agent investigating a production incident needs to remember which services it already checked, what metrics it found anomalous, which hypotheses it ruled out, and what dependencies connect the affected services. Stuffing this into a conversation context window works for a few minutes of investigation. It fails on tasks that span hours or days because the context window fills with intermediate steps that obscure the high-level state.

The problem compounds in multi-agent systems. When agents collaborate on a task, they need shared memory that all agents can read from and write to. Agent A discovers that the database connection pool is saturated. Agent B needs to know this before it investigates the application layer, or it will waste time rediscovering the same fact. Without shared memory, multi-agent systems degrade into parallel independent investigations that duplicate work and miss connections between findings.

Research from the LOCOMO benchmark shows that agents without persistent memory lose accuracy on tasks requiring information from more than two sessions ago. They cannot recall specific details from earlier interactions, confuse temporal ordering of events, and fail to maintain consistency in long-running workflows. Memory transforms agents from stateless functions that happen to use LLMs into systems that accumulate knowledge and improve over time.

The failure modes are specific and measurable. An agent without memory repeats actions it already performed ("let me check the database logs" when it already checked them an hour ago and found nothing). It makes contradictory decisions across sessions because it cannot recall its earlier reasoning. It asks users for information that it was already given in a previous conversation. It fails to learn from mistakes because it has no record that a particular approach was tried and failed. Each of these failures erodes user trust and wastes time, and all of them are eliminated by adding a persistent memory layer that the agent can write to and read from across sessions.

Working Memory vs Long-Term Memory

Agent memory systems mirror the distinction between working memory and long-term memory in human cognition, and getting this distinction right is critical for building agents that perform well on both immediate tasks and tasks that span days or weeks.

Working memory is the active context that an agent uses during a single task execution. It includes the current goal, the plan for achieving it, intermediate results, the current step, and recently discovered information that has not yet been validated or integrated into long-term storage. Working memory is fast, small, and volatile. It maps naturally to the LLM's context window plus any scratchpad or state variables maintained during a single agent loop. When the agent completes a task or the session ends, working memory should be selectively promoted to long-term storage: the conclusions persist, but the intermediate reasoning steps typically do not.

Long-term memory is the persistent store of validated knowledge that the agent has accumulated across all sessions. It includes facts about the environment (which services depend on which databases, what the deploy process looks like, who owns each component), learned procedures (the sequence of steps that successfully diagnosed a connection pool issue last time), and episodic records (specific incidents, their causes, and their resolutions). Long-term memory is slower to access than working memory because it requires a retrieval step, but it is persistent and potentially much larger.

The promotion process between working and long-term memory is where most agent systems fall short. Naive approaches dump everything from the conversation history into long-term storage, which fills the memory with low-value intermediate steps and dilutes retrieval quality. Effective approaches are selective: an agent finishes a task and stores the outcome, key decisions, and any surprising discoveries, but discards the step-by-step reasoning that led there. This is analogous to how human memory consolidation works during sleep, retaining the important patterns and discarding the irrelevant details.

Adaptive Recall handles this distinction through its memory lifecycle. The store tool writes observations with metadata about source, confidence, and context. The consolidation process periodically reviews related memories, merges redundant ones, strengthens well-corroborated ones, and lets unverified observations fade. This means agent developers do not need to build their own promotion logic; the memory system handles the transition from working observations to consolidated long-term knowledge automatically.

Multi-Agent Memory Sharing

Multi-agent architectures introduce memory challenges that do not exist for single agents. When multiple agents operate on the same knowledge base, you need patterns for sharing, isolation, and conflict resolution that prevent agents from interfering with each other while still allowing them to benefit from each other's discoveries.

Shared Memory Bus

The simplest sharing pattern gives all agents read and write access to a single memory store. Agent A stores a finding, Agent B retrieves it on its next recall. This works well when agents have complementary roles (one researches, one acts) and there is low risk of conflicting writes. The downside is that poorly structured observations from one agent can pollute the retrieval results for all other agents. If Agent A stores verbose, low-confidence notes, those notes show up when Agent B searches for related information.

Scoped Namespaces

A more structured pattern gives each agent its own namespace for writing while allowing read access to other namespaces. Agent A writes to the "research" namespace, Agent B writes to the "execution" namespace, and both can read from both. This prevents write conflicts while preserving the ability to share knowledge. A coordinator agent can also maintain a "shared" namespace that contains validated, high-confidence information promoted from individual agent namespaces.

Blackboard Architecture

The blackboard pattern, borrowed from classical AI, uses a shared memory space where agents post observations and read each other's posts. A control component decides which agent runs next based on the current state of the blackboard. This is effective for problems where agents have different expertise and the solution requires combining their contributions. The memory system acts as the coordination mechanism rather than direct agent-to-agent communication.

Event-Driven Memory Updates

In more sophisticated multi-agent systems, memory updates trigger events that other agents can subscribe to. When Agent A stores a finding about high database latency, a memory event fires that wakes up Agent B (the infrastructure monitor) to check network conditions and Agent C (the application profiler) to look for slow queries. The memory system acts as both the knowledge store and the coordination bus, with new memories triggering downstream investigation rather than requiring a central orchestrator to dispatch tasks.

This pattern reduces coordination overhead because agents respond to relevant new knowledge rather than polling for updates. It also reduces wasted work because agents see each other's findings before starting their own investigation. The trade-off is complexity: event-driven architectures require careful design to avoid cascading loops where Agent A's finding triggers Agent B, whose finding triggers Agent A, and so on. Circuit breakers and cooldown periods on memory-triggered actions prevent these loops.

Adaptive Recall supports these patterns through its tagging and metadata system. Agents can tag memories with their agent ID, role, confidence level, and task context. Recall queries can filter by tags, so an agent can search only its own memories, only shared memories, or all memories with weighted preferences. The knowledge graph connects entities across agent boundaries, so Agent A's discovery about a database connection appears when Agent B searches for information about the application that depends on that database, even though Agent B never directly accessed Agent A's namespace.

State Persistence and Recovery

Agents crash. Services restart. Deployments roll. Any agent system that runs for more than a few minutes will eventually face an interruption that terminates execution mid-task. Without state persistence, the agent restarts from zero and either repeats all previous work or, worse, takes a different path that contradicts actions already taken.

Checkpointing is the standard pattern for state persistence in agent systems. At each significant step, the agent writes a checkpoint that captures: the current goal, the plan, which steps have been completed, intermediate results, and the current working memory state. If the agent is interrupted, it restores from the last checkpoint and continues from that point rather than restarting. The frequency of checkpointing trades off between recovery granularity and performance overhead. Checkpointing after every tool call provides fine-grained recovery but adds latency to every operation. Checkpointing at task boundaries (when a sub-task completes) provides coarser recovery but lower overhead.

The challenge is that LLM-based agents do not have deterministic state in the traditional sense. The "state" of an agent includes the conversation context, which may include stochastic elements that produce different continuations even from the same checkpoint. Effective agent checkpointing stores the factual state (completed steps, results, decisions) rather than the full conversation history, and reconstructs a fresh context from the checkpoint data when resuming. This approach is more reliable because the resumed agent makes decisions based on facts rather than on a replayed conversation that may lead to different reasoning paths.

Durable execution frameworks like Temporal provide infrastructure for agent persistence at the workflow level. The workflow engine persists each step's input and output, so if the agent process dies, the framework restarts it and replays the completed steps. This is powerful for multi-step agent workflows but requires integrating the agent's execution model with the workflow engine's step-based model.

An alternative to checkpointing is event sourcing: instead of saving snapshots of the agent's state, log every action and observation as an immutable event. To recover, replay the event log to reconstruct the current state. Event sourcing has the advantage of a complete audit trail (you can see exactly what the agent did and in what order) and the ability to reconstruct the state at any past point in time. The disadvantage is that replay time grows linearly with the number of events, so long-running agents with thousands of events need periodic snapshot compaction to keep recovery time reasonable.

For most agent architectures, checkpointing at task boundaries is the practical choice. It provides fast recovery (read the last checkpoint, resume from that point) with low overhead (one write per completed sub-task). Event sourcing is worth the additional complexity when you need detailed audit trails for compliance, debugging, or when the agent's decisions have high-stakes consequences that may need post-hoc review.

Handling Memory Conflicts

When multiple agents write to shared memory, conflicts are inevitable. Agent A determines that the API latency issue is caused by a slow database query. Agent B, investigating the same issue from the infrastructure side, determines that the cause is network congestion between availability zones. Both store their findings. When Agent C retrieves context for a summary, it finds two contradictory explanations.

The naive approach is last-write-wins: whichever agent wrote most recently overwrites the previous finding. This is simple but dangerous because it silently discards information that may be correct. A more robust approach is to store both observations with confidence scores and source attribution. When an agent retrieves conflicting memories, it sees both explanations with their evidence and can reason about which is more likely correct, or determine that both factors contribute to the issue.

Adaptive Recall handles conflicts through its contradiction detection and confidence scoring. When a new memory contradicts an existing one, both are preserved with their respective confidence scores. The consolidation process reviews contradictions and adjusts scores based on corroborating evidence. If subsequent investigation confirms Agent A's database query hypothesis (perhaps a third agent finds the slow query in the logs), that memory's confidence increases while the network congestion memory's confidence decreases. Neither memory is deleted; the retrieval ranking naturally favors the better-supported explanation.

For operational safety, agents should also implement optimistic concurrency control when updating shared state. Before writing an update to a memory, check that the memory has not been modified since it was last read. If it has, re-read the current version, merge the changes, and write the merged result. This prevents the lost-update problem where two agents simultaneously update the same memory and one agent's changes are silently overwritten.

Production Patterns

Production agent memory systems share several patterns that distinguish them from prototype implementations. These patterns address the problems that only emerge after weeks or months of continuous operation: memory bloat, stale knowledge, inconsistent quality across agents, and the difficulty of debugging autonomous systems.

Memory Budgets

Agents that store everything quickly accumulate a memory store full of low-value observations that dilute retrieval quality. An agent that stores 50 intermediate observations per task session produces a memory store where 80% of entries are step-by-step reasoning notes that are rarely useful in future tasks. When a recall query returns 10 results, 8 of them are these intermediate notes and only 2 are the high-value conclusions.

Production systems set memory budgets, either in total memory count or in total token volume, and enforce them through importance-based eviction. When the budget is reached, the lowest-importance memories are archived or deleted. The importance score combines access frequency (memories that are actually recalled are more important), recency (recent observations are more likely relevant), confidence (well-corroborated facts outrank speculative notes), and explicit importance flags set by the agent (the agent can mark certain findings as critical). A budget of 500 to 2,000 active memories per agent handles most production use cases, with consolidation keeping the count bounded by merging related memories.

Temporal Context

Every memory should be stored with a timestamp and, when relevant, a temporal scope. The timestamp records when the observation was made. The temporal scope records how long the observation is expected to remain valid: "this is the current database schema" has a different shelf life than "the team decided to migrate from MySQL to PostgreSQL." Without temporal scope, agents treat a configuration snapshot from six months ago with the same weight as yesterday's finding, leading to outdated information polluting current retrievals.

Retrieval queries should filter by time range when appropriate, and the scoring model should apply recency decay so that recent information outranks older information when both are relevant. For time-sensitive domains (incident response, infrastructure monitoring), aggressive recency weighting ensures the agent works with current data. For stable domains (architecture decisions, business rules), lower recency weighting preserves historically important knowledge that changes slowly.

Agent Attribution and Trust Calibration

Every memory should record which agent stored it, in what context, and with what confidence. This serves three purposes. First, filtering: an orchestrator agent can search only findings from the monitoring agent when investigating performance issues, ignoring unrelated observations from other agents. Second, trust calibration: not all agents produce equally reliable observations, and the retrieval scoring can weight memories from higher-quality agents more heavily. An agent that has been validated against ground truth at 95% accuracy should have its memories ranked above those from an agent validated at 70% accuracy.

Third, attribution supports debugging. When an agent makes a bad decision based on retrieved memories, you can trace the chain: which memories were retrieved, which agents stored those memories, what was the original context, and what confidence level was assigned. Without attribution, debugging an autonomous multi-agent system is nearly impossible because you cannot determine where incorrect information entered the system.

Garbage Collection and Lifecycle Management

Memories that are never accessed, have low confidence, and are older than a threshold should be automatically removed. Without garbage collection, the memory store grows monotonically and retrieval quality degrades as the signal-to-noise ratio drops. A memory store with 10,000 entries where 8,000 are stale, low-confidence observations performs worse than a curated store of 2,000 high-quality memories because the stale entries compete for retrieval slots and dilute the ranking signal.

Effective garbage collection operates on a tiered schedule. Memories below a confidence threshold (for example, 3.0 on a 10-point scale) that have not been accessed in 30 days are candidates for removal. Memories above a higher threshold (for example, 8.0) are protected regardless of access patterns because they represent well-corroborated knowledge that may be needed for rare but important queries. Between these thresholds, a decay function gradually reduces retrieval priority based on time since last access, allowing the memory to be naturally displaced by more relevant content without hard deletion.

Adaptive Recall handles this through its lifecycle fading mechanism. Memories that are not accessed and not reinforced by corroborating evidence gradually decrease in retrieval priority until they fall below the recall threshold. High-confidence memories above 8.0 are protected from fading. Consolidation merges related memories, keeping the knowledge while reducing the entry count. The result is a memory store that stays lean and relevant without manual cleanup.

Implementation Guides

Adding Memory to Agents

Reliability and Consistency

Core Concepts

Memory Architecture

Performance and Reliability

Common Questions

Give your agents memory that persists, learns, and scales across multi-agent systems. Adaptive Recall handles persistence, conflict resolution, and lifecycle management so your agents can focus on their tasks.

Get Started Free