Home » Memory-Powered Customer Service » Verify Your AI Remembers Customers

How to Verify Your AI Remembers Customers

Verifying that your AI actually remembers customers requires testing three things: that memories are stored correctly after conversations, that relevant memories are retrieved at the start of new conversations, and that the AI uses retrieved memories to change its behavior rather than ignoring them. A memory system that stores data but never surfaces it, or that surfaces memories the AI ignores, provides no value despite appearing functional.

Before You Start

You need a test customer profile with at least five stored memories covering different topics, time periods, and interaction types. If your system is already in production, create a dedicated test customer rather than testing with real customer data. You also need a way to start fresh conversations as the same customer, simulating the multi-session experience that memory is designed to support.

Step-by-Step Verification

Step 1: Create a test customer with known history.
Store a specific set of memories for a test customer that covers the major categories your system tracks. Include recent and old memories, resolved and unresolved issues, preferences, and factual account information. Document exactly what you stored so you can verify what the AI should recall.

# Create test customer memories with known content
test_memories = [
    {
        "text": "Customer uses Python 3.11 with FastAPI for their "
                "backend. They run on AWS ECS with PostgreSQL.",
        "metadata": {"customer_id": "test-001", "type": "semantic",
                     "topic": "tech_stack", "age_days": 30}
    },
    {
        "text": "Customer had a rate limiting issue last week. "
                "Resolved by upgrading to the Professional plan.",
        "metadata": {"customer_id": "test-001", "type": "episodic",
                     "topic": "rate_limiting", "age_days": 7}
    },
    {
        "text": "Customer prefers concise, technical responses. "
                "They explicitly asked us not to be verbose.",
        "metadata": {"customer_id": "test-001", "type": "preference",
                     "topic": "communication_style", "age_days": 14}
    },
    {
        "text": "Customer is evaluating our enterprise plan for Q3. "
                "Decision depends on SSO support.",
        "metadata": {"customer_id": "test-001", "type": "semantic",
                     "topic": "upgrade_evaluation", "age_days": 3}
    }
]
for memory in test_memories:
    memory_api.store(memory)

The specific details matter for verification. You need to be able to ask questions like "Does the AI know what language I use?" and have a clear expected answer (Python 3.11 with FastAPI). Vague memories produce vague verification results.

Step 2: Run multi-session recall tests.
Start a new conversation as the test customer and ask questions that should trigger memory recall. Do not ask "What do you remember about me?" because that is an artificial prompt no real customer would use. Instead, ask natural questions that should be informed by memory. The test passes if the AI's response demonstrates awareness of stored context without being asked to recall it.

# Test cases for memory recall
test_cases = [
    {
        "message": "I am having trouble with my API integration.",
        "expected_context": ["Python", "FastAPI", "AWS ECS"],
        "pass_if": "Response references their known tech stack "
                   "instead of asking what language they use"
    },
    {
        "message": "I am hitting rate limits again.",
        "expected_context": ["rate_limiting", "Professional plan"],
        "pass_if": "Response acknowledges the previous rate "
                   "limiting issue and their plan upgrade"
    },
    {
        "message": "Can you explain the SSO integration options?",
        "expected_context": ["enterprise plan", "Q3 evaluation"],
        "pass_if": "Response connects SSO question to their "
                   "known enterprise plan evaluation"
    }
]

Run each test case in a separate conversation session to verify that memory persists across sessions. If you run all tests in the same session, you are testing conversation history, not persistent memory. The distinction matters because conversation history is automatic in any LLM system. Persistent memory across sessions is what your implementation adds.

Step 3: Test edge cases for memory retrieval.
Verify how the system behaves in scenarios where memory retrieval is imperfect. Test with a brand-new customer who has no memories, verify the system does not hallucinate context. Test with conflicting memories, where an old memory says one thing and a recent memory says something different, and verify the system prioritizes the recent information. Test with a customer who has hundreds of memories and verify that retrieval remains fast and returns the most relevant results rather than an arbitrary subset.

# Edge case: No memories exist
def test_new_customer():
    response = chat("I need help with my account",
                     customer_id="new-customer-no-history")
    assert "I see this is your first time" in response \
        or response_asks_for_context(response)

# Edge case: Contradictory memories
def test_conflicting_memories():
    memory_api.store({
        "text": "Customer uses Node.js for their backend.",
        "metadata": {"customer_id": "test-002", "age_days": 90}
    })
    memory_api.store({
        "text": "Customer migrated to Go for their backend.",
        "metadata": {"customer_id": "test-002", "age_days": 5}
    })
    response = chat("Help with my backend setup",
                     customer_id="test-002")
    assert "Go" in response  # Should use recent info

The conflicting memory test is critical. Cognitive scoring should naturally handle this by ranking the recent memory higher, but implementations that use simple retrieval without recency weighting may return the older memory first, causing the AI to reference outdated information. This is one of the most common failure modes in production memory systems.

Step 4: Measure recall accuracy with metrics.
Define quantitative metrics that track memory quality over time. Recall precision measures what percentage of retrieved memories are actually relevant to the current conversation. Recall coverage measures what percentage of memories that should have been retrieved were actually retrieved. Context utilization measures what percentage of retrieved memories the AI actually uses in its response.

# Track memory retrieval quality
def measure_recall_quality(conversation_log):
    retrieved = conversation_log['memories_retrieved']
    used = conversation_log['memories_referenced_in_response']
    relevant = human_annotate_relevance(retrieved,
                                        conversation_log['query'])

    precision = len(relevant) / len(retrieved) if retrieved else 0
    utilization = len(used) / len(retrieved) if retrieved else 0

    return {
        "precision": precision,
        "utilization": utilization,
        "retrieved_count": len(retrieved),
        "used_count": len(used)
    }

Aim for precision above 70% (most retrieved memories are relevant) and utilization above 50% (the AI uses at least half of what is retrieved). Low precision means the retrieval query is too broad or the memory store contains too much noise. Low utilization means the AI is ignoring relevant context, which usually indicates a system prompt issue rather than a memory issue.

Step 5: Set up ongoing monitoring.
Memory quality can degrade over time as customers accumulate more memories, as the data distribution shifts, or as system changes affect retrieval behavior. Set up automated monitoring that runs verification tests on a schedule and alerts when metrics drop below thresholds.

# Daily automated memory health check
def daily_memory_check():
    # Verify test customer memories still retrieve correctly
    for test_case in standard_test_cases:
        result = memory_api.recall(
            query=test_case['query'],
            filter={"customer_id": test_case['customer_id']}
        )
        if not meets_quality_threshold(result, test_case):
            alert("Memory recall quality below threshold",
                  test_case=test_case, result=result)

    # Check retrieval latency
    latency = measure_retrieval_latency()
    if latency > 500:  # ms
        alert(f"Memory retrieval latency {latency}ms "
              f"exceeds 500ms threshold")

Monitor retrieval latency alongside accuracy. As the memory store grows, retrieval can slow down if indices are not maintained. A memory system that returns accurate results in 2 seconds is effectively broken for real-time customer conversations where the bot needs to respond within 3 to 5 seconds total.

Build customer memory you can trust. Adaptive Recall provides retrieval metrics, health monitoring, and the status tool to verify your memory system is working correctly in production.

Try It Free

How to Verify Your AI Remembers Customers

Before You Start

Step-by-Step Verification

Related Articles