Home » AI Agent Memory » State Persistence Strategies

7 State Persistence Strategies for Agents

Persisting agent state is the difference between an agent that survives interruptions and one that loses hours of work on a restart. The seven strategies below range from simple (local file checkpoints) to sophisticated (durable execution frameworks), each with different durability guarantees, complexity costs, and performance characteristics. Most production systems combine two or three: a fast strategy for within-task checkpointing and a durable strategy for cross-session knowledge.

1. Local File Checkpoints

The simplest persistence strategy: serialize the agent's state to a JSON file on the local filesystem after each significant step. On restart, check for the file and resume from the last saved state.

Durability: Survives process crashes but not machine failures. If the disk or container is destroyed, the checkpoint is lost.

Performance: Fast, typically under 1ms for a 10KB JSON write. No network overhead.

Best for: Development, testing, and single-machine agents where the execution environment is stable. Not suitable for containerized deployments where the filesystem is ephemeral.

2. Database Checkpoints

Store checkpoints in a database (PostgreSQL, DynamoDB, Redis with persistence) that is separate from the agent's execution environment. The checkpoint survives machine failures because the database is independently managed and replicated.

Durability: Survives machine failures. Durability depends on the database configuration: PostgreSQL with synchronous commits and replication is highly durable; Redis with AOF is durable to the last fsync.

Performance: 5 to 50ms per checkpoint write depending on the database, network latency, and write concern settings. Acceptable for step-level checkpointing where steps take seconds to minutes.

Best for: Production agents running in containers, serverless, or distributed environments. This is the most common strategy for production agent systems because it balances durability with implementation simplicity.

3. Event Sourcing

Instead of saving the current state, record every state-changing event (tool calls, decisions, discoveries) as an append-only log. To recover, replay all events from the log to reconstruct the current state. This gives you a complete history of how the agent reached its current state, not just the state itself.

Durability: Same as the event store (typically a database or message queue). The complete history is preserved, so you can reconstruct the state at any point in time.

Performance: Writes are fast (append-only). Recovery is slower because it requires replaying all events. For agents with hundreds of events, recovery can take seconds. For agents with thousands of events, consider periodic snapshots that compress the event history.

Best for: Agents where auditability matters (compliance, debugging, accountability). The event log tells you not just what the agent knows but every step of how it got there.

4. LangGraph Checkpointers

LangGraph provides built-in state persistence through its checkpointer interface. After each graph node executes, the checkpointer serializes the graph state to a configurable backend (SQLite, PostgreSQL). On restart, the graph resumes from the last completed node.

Durability: Depends on the backend. SqliteSaver provides local file durability. PostgresSaver provides distributed durability. Both integrate transparently with the LangGraph execution model.

Performance: The checkpointer adds a write after each node. For nodes that take seconds, the overhead is negligible. For very fast nodes (sub-100ms), the checkpointing overhead may be noticeable.

Best for: Agents built on LangGraph. The integration is nearly zero-code: add three lines to configure the checkpointer and the framework handles the rest.

5. Memory API as State Store

Use a memory API (like Adaptive Recall) as both the long-term knowledge store and the task state store. At each checkpoint, store the current task state as a tagged memory. On recovery, recall memories tagged with the current task ID to reconstruct the state.

Durability: Same as the memory API's persistence guarantees. Adaptive Recall provides durable storage with replication.

Performance: Checkpoint writes go through the memory API, which is typically 50 to 200ms. This is slower than direct database writes but has the advantage of making the checkpoint searchable and available to other agents through the normal recall mechanism.

Best for: Agents that already use a memory API for knowledge management. The task state becomes part of the memory graph, visible to other agents and benefiting from the same retrieval, scoring, and lifecycle management as regular memories.

6. Durable Execution (Temporal, Inngest)

Durable execution frameworks like Temporal and Inngest manage persistence at the workflow level. The framework records the input and output of each workflow step. If the worker process dies, the framework restarts it and replays completed steps (using cached outputs) to reach the current point. The agent developer writes normal sequential code and the framework handles durability transparently.

Durability: Very high. The framework's persistence layer (typically a database cluster) is designed for exactly this purpose. Steps are guaranteed to complete at least once.

Performance: Each step adds a persistence round-trip (10 to 50ms). The replay on recovery is fast because outputs are cached. The main cost is adapting the agent's execution model to the framework's step-based model.

Best for: Long-running agent workflows (hours to days) where reliability is critical. Temporal in particular is battle-tested for exactly this use case in non-AI contexts and translates well to agent orchestration.

7. Conversation State Snapshots

Periodically compress and save the agent's conversation history to an external store. On recovery, load the saved conversation and continue from where it left off. This is the highest-fidelity approach because it preserves the full context, but it is also the most brittle because LLM responses are non-deterministic and the recovered conversation may lead to different reasoning than the original.

Durability: Depends on the storage backend. The conversation snapshot can be stored in a database, file system, or memory API.

Performance: Snapshots are large (tens to hundreds of KB) and slow to write. Recovery requires loading the full conversation into the context window, which may be slow for long conversations.

Best for: Short agent tasks (under 10 minutes) where the conversation context is small enough to save and reload without degradation. Not recommended for long-running tasks where the conversation exceeds the context window.

Combining Strategies

Most production agent systems use two strategies together. A fast, lightweight strategy (local file or database checkpoints) handles within-task state so the agent can resume from the last step after a crash. A durable, semantic strategy (memory API) handles cross-task knowledge so the agent benefits from what it learned in previous tasks. The checkpoint is ephemeral: it exists during the task and is deleted when the task completes. The memory is permanent: key findings from the task persist indefinitely.

Adaptive Recall fits naturally as the durable knowledge layer in this combination. The agent uses database checkpoints for step-level recovery and Adaptive Recall for knowledge that should persist across tasks. When the task completes, the agent stores its key findings in Adaptive Recall and deletes the checkpoint. Future tasks retrieve those findings through normal recall, benefiting from cognitive scoring, entity connections, and lifecycle management.

Add the durable knowledge layer to your agent persistence strategy. Adaptive Recall provides the cross-task memory that complements your within-task checkpointing.

Get Started Free