How Much Latency Does Cognitive Scoring Add
Detailed Latency Breakdown
Cognitive scoring runs as a post-processing step after the initial vector search returns a candidate set. The vector search itself (the most variable component) is the same whether you use cognitive scoring or not. The cognitive scoring overhead is purely the additional computation needed to rerank the candidate set.
Base-Level Activation: 1-3ms
With precomputed activation values (the recommended approach), this phase is a database lookup for each candidate memory. For a candidate set of 30 items, that means 30 key-value lookups. With in-memory caching or an indexed database, each lookup takes microseconds. The total time is dominated by the number of candidates rather than the complexity of the calculation.
Without precomputation (computing activation from the raw access history at query time), this phase takes 5 to 15 milliseconds because each candidate requires iterating through its access history and computing the power-law sum. Precomputation reduces this by an order of magnitude.
Spreading Activation: 5-20ms
This is the most variable component because it depends on the size and connectivity of the entity graph. The process involves extracting entities from the query (fast string matching against the known entity index), looking up connected memories for each query entity (hash table lookups), optionally expanding to depth-2 neighbors, and computing the weighted sum of activation contributions for each candidate.
With an in-memory entity graph cache, each step is a hash table lookup costing microseconds. The total time depends on how many query entities are found (typically 2 to 5 for a natural language query) and how many depth-2 neighbors each entity has. For typical configurations, the total is 5 to 15 milliseconds. Highly connected graphs (entities with hundreds of neighbors) push this toward 20 milliseconds.
Confidence and Score Blending: less than 1ms
Confidence is a stored value (a single float per memory), so looking it up is trivial. Score blending involves a weighted sum of four floats per candidate, followed by a sort of the candidate set. For 30 candidates, this is computational noise, well under a millisecond.
Total Overhead in Context
The total cognitive scoring overhead of 15 to 40 milliseconds needs to be evaluated in the context of the full retrieval pipeline:
| Component | Typical Latency | % of Total |
|---|---|---|
| Network round trip to API | 20-100ms | ~15% |
| Vector search (embedding + similarity) | 10-50ms | ~10% |
| Cognitive scoring (reranking) | 15-40ms | ~5% |
| LLM inference (using retrieved context) | 500-3000ms | ~70% |
LLM inference dominates the total response time by a wide margin. The cognitive scoring overhead is a small fraction of the non-LLM components and an even smaller fraction of the total. In practice, users cannot perceive the difference between a response that took 1200 milliseconds and one that took 1230 milliseconds.
When Latency Matters
There are a few scenarios where even 15 to 40 milliseconds matters:
- High-frequency batch retrieval: If you are running thousands of retrievals per second (batch processing, bulk analysis), the cumulative overhead becomes significant. In this case, disable spreading activation and use only precomputed base-level activation for a sub-5ms overhead.
- Latency-critical user-facing search: If retrieval is the entire operation (no LLM inference afterward) and the user expects sub-100ms results, like a search-as-you-type feature, every millisecond counts. Use the fast scoring mode (similarity + precomputed activation only) for these paths.
- Edge deployments: On resource-constrained devices, the entity graph cache may not fit in memory, requiring database lookups that add latency. Consider reducing the graph to the most frequently accessed entities for edge deployments.
For the vast majority of AI memory applications, where retrieval feeds context into an LLM call, the cognitive scoring overhead is negligible and the retrieval quality improvement is significant.
Optimization Levers
If latency is a concern, you have several options for reducing cognitive scoring overhead without removing it entirely:
- Reduce the candidate set size from the default (fewer candidates to score).
- Disable depth-2 spreading activation (saves 5 to 10ms in most configurations).
- Use precomputed activation values (reduces activation lookup from 15ms to under 3ms).
- Cache the entity graph in application memory (eliminates database lookups for spreading activation).
- Skip spreading activation entirely for ultra-low latency (under 5ms total scoring overhead).
Each lever trades some retrieval quality for reduced latency. The default configuration maximizes retrieval quality. The fastest configuration (precomputed activation only, no spreading activation) adds under 5 milliseconds but misses the contextual connections that spreading activation provides.
Production-grade cognitive scoring with sub-50ms overhead. Adaptive Recall optimizes the full pipeline so you get the best retrieval quality at minimal latency cost.
Try It Free