Home » Cognitive Scoring » Two-Stage Retrieval

How to Build a Two-Stage Retrieval System

A two-stage retrieval system separates fast candidate selection from precision ranking. The first stage uses vector similarity to quickly find memories that are topically relevant. The second stage scores those candidates on additional factors like recency, entity connections, and confidence to determine the final ranking. This architecture gives you the speed of vector search with the accuracy of multi-factor scoring.

Before You Start

You need a vector database or embedding index (Pinecone, pgvector, Qdrant, Weaviate, or similar) and the ability to store metadata alongside your embeddings. The second stage requires metadata fields that are not part of a standard vector store setup, so plan to either extend your vector store schema or maintain a separate metadata store that you join with vector results at query time.

This guide builds a system from scratch. If you already have a single-stage RAG pipeline and want to add reranking to it, see the reranking guide instead.

Step-by-Step Implementation

Step 1: Set up the vector store for candidate retrieval.
Configure your vector database to return more results than you plan to show the user. The first stage is a recall-optimized step: its job is to make sure the correct answer is somewhere in the candidate set, even if it is not ranked first. Set the default retrieval count to 30 to 50 candidates. This is fast because vector search with precomputed embeddings scales sub-linearly with index size, and the difference between retrieving 5 and 50 results is typically under 10 milliseconds.

# pgvector example: retrieve top 40 candidates
SELECT id, content, embedding, confidence, access_times, entities,
       1 - (embedding <=> query_embedding) AS similarity
FROM memories
WHERE tenant_id = $1
ORDER BY embedding <=> query_embedding
LIMIT 40;

Step 2: Design the candidate schema.
Each candidate record needs the fields that the second stage will score on. Beyond content and embedding, include: an array of access timestamps (for base-level activation), a list of extracted entities (for spreading activation), a confidence score (for reliability weighting), a corroboration count (for confidence context), and the creation timestamp. If your vector store does not support complex metadata, store these fields in a relational database and join by memory ID after the vector query.

{
  "id": "mem_xyz789",
  "content": "The rate limit was raised to 500 RPM on March 15",
  "embedding": [0.012, -0.034, ...],
  "similarity": 0.87,
  "created_at": "2026-03-15T09:00:00Z",
  "access_times": ["2026-03-15T09:00:00Z", "2026-03-20T14:30:00Z",
                    "2026-04-02T11:15:00Z", "2026-05-01T16:45:00Z"],
  "entities": ["rate limit", "API", "RPM", "throttling"],
  "confidence": 8.2,
  "corroboration_count": 3
}

Step 3: Build the first-stage retriever.
The retriever function takes a query, embeds it, and returns candidates with their similarity scores and metadata. Keep this function simple and fast. It should not do any complex scoring, just vector similarity and metadata fetching. If you need to join metadata from a separate store, use batch lookups rather than per-candidate queries to keep latency low.

async def retrieve_candidates(query: str, top_k: int = 40):
    query_embedding = await embed(query)

    # vector search returns candidates with similarity scores
    candidates = await vector_store.query(
        embedding=query_embedding,
        top_k=top_k,
        include_metadata=True
    )

    return candidates, query_embedding

Step 4: Build the second-stage scorer.
The scorer takes the candidate list and produces a final score for each candidate by combining vector similarity with cognitive factors. Normalize each scoring dimension to a 0 to 1 range before combining them so that no single dimension dominates due to scale differences. Apply a sigmoid function to base-level activation (which can range from negative infinity to positive values) to map it into the 0 to 1 range.

import math

def score_candidate(candidate, query_entities, entity_graph):
    # vector similarity (already 0-1)
    sim = candidate['similarity']

    # base-level activation from access history
    bla = base_level_activation(candidate['access_times'])
    bla_norm = 1.0 / (1.0 + math.exp(-bla))  # sigmoid to 0-1

    # spreading activation from entity graph
    sa = spreading_activation(candidate, query_entities, entity_graph)
    sa_norm = min(sa / 5.0, 1.0)  # cap at 1.0

    # confidence (normalize from 0-10 to 0.5-1.0)
    conf = 0.5 + (candidate.get('confidence', 5.0) / 10.0) * 0.5

    # weighted combination
    score = (0.40 * sim +
             0.30 * bla_norm +
             0.20 * sa_norm)
    score *= conf

    return score

Step 5: Connect the stages.
Wire the retriever and scorer together in a pipeline function. This function handles the full flow: embed the query, retrieve candidates, extract query entities for spreading activation, score each candidate, sort by score, and return the top results. Add a minimum score threshold to filter out low-quality candidates before they reach the LLM.

async def two_stage_retrieve(query: str, final_k: int = 5,
                             min_score: float = 0.3):
    candidates, query_embedding = await retrieve_candidates(query, top_k=40)
    query_entities = extract_entities(query)
    entity_graph = await load_entity_graph()

    scored = []
    for candidate in candidates:
        score = score_candidate(candidate, query_entities, entity_graph)
        if score >= min_score:
            scored.append({**candidate, 'final_score': score})

    scored.sort(key=lambda x: x['final_score'], reverse=True)
    return scored[:final_k]

Step 6: Optimize the candidate count.
The number of first-stage candidates controls the trade-off between recall and latency. More candidates mean a higher chance of including the correct answer but more work for the second stage. Start with 40 candidates and measure the impact of increasing or decreasing. If your second stage uses only precomputed metadata (like cognitive scoring), 50 to 100 candidates add negligible latency. If it uses model inference (like cross-encoders), keep the candidate count under 30 to control latency.

Performance Benchmarks

A typical two-stage pipeline with cognitive scoring shows the following latency breakdown: query embedding takes 20 to 50 milliseconds (depends on your embedding model and whether you call an API or run locally), vector search takes 5 to 15 milliseconds for stores under 1 million records, metadata fetching takes 2 to 10 milliseconds depending on storage backend, and cognitive scoring takes 15 to 40 milliseconds including graph traversal. Total end-to-end retrieval latency is typically 50 to 100 milliseconds, well within the latency budget for interactive applications.

For comparison, a single-stage pipeline with vector search only runs in 25 to 65 milliseconds. The two-stage approach adds 25 to 50 milliseconds for a meaningful improvement in ranking quality. The added latency is almost entirely in the cognitive scoring step, not in retrieving more candidates from the vector store.

When to Add a Third Stage

Some applications benefit from a third stage that uses an LLM to evaluate the final candidates. After vector retrieval and cognitive scoring narrow the results to 5 to 10 items, you send each candidate and the query to an LLM with a prompt asking it to rate relevance. This catches subtleties that neither embeddings nor metadata can capture, like whether the candidate actually answers the question rather than just discussing the same topic. The trade-off is significant latency (500 milliseconds to 2 seconds) and cost (LLM tokens for each evaluation). Use this only for high-stakes applications where accuracy justifies the added expense.

Get two-stage retrieval with cognitive scoring out of the box. No pipeline to build or maintain.

Get Started Free

How to Build a Two-Stage Retrieval System

Before You Start

Step-by-Step Implementation

Performance Benchmarks

When to Add a Third Stage

Related Articles