How Much Does Adding a Reranker Slow Down Search
Latency Breakdown by Method
| Method | 20 Candidates | 50 Candidates | Hardware |
|---|---|---|---|
| Cognitive scoring | 15-25ms | 25-40ms | CPU only |
| MiniLM cross-encoder | 15-30ms | 30-70ms | GPU |
| BGE-reranker-v2 | 80-150ms | 180-350ms | GPU |
| Cohere Rerank API | 100-200ms | 150-300ms | Hosted |
| LLM-as-a-judge | 500-1500ms | 1000-3000ms | API call |
Where the Time Goes
Cognitive scoring spends most of its time on entity graph traversal for spreading activation (5 to 20ms). Base-level activation calculation, confidence weighting, and score combination are all sub-millisecond operations on precomputed values. If you disable spreading activation and use only base-level activation and confidence, the overhead drops to under 5ms.
Cross-encoder models spend their time on transformer inference: tokenizing the query-document pairs, running them through the model's attention layers, and extracting the relevance score. Larger models (BGE-reranker-v2 at 560M parameters) take longer than smaller models (MiniLM at 22M parameters). GPU inference is 5 to 10 times faster than CPU inference for these models. Batch processing (scoring all candidates in one pass) is critical because it amortizes GPU kernel launch overhead.
LLM-as-a-judge spends time on the full LLM inference cycle: API call overhead (network round trip), token generation for the relevance evaluation, and response parsing. Even with parallel evaluation of all candidates, the minimum latency is bounded by a single LLM call (typically 300 to 800ms), and the maximum grows with candidate count because API rate limits may force sequential processing.
Optimizing Reranker Latency
The most effective optimization is reducing the candidate count. Reranking 20 candidates instead of 50 roughly halves the latency for model-based methods (cross-encoders process candidates sequentially or in small batches). For cognitive scoring, the reduction is less dramatic because the graph lookup is the bottleneck, not per-candidate computation.
Other optimizations include quantizing cross-encoder models to INT8 (2x speedup, minimal accuracy loss), using ONNX Runtime instead of PyTorch (30 to 50 percent speedup from graph optimization), batching candidates into a single inference call (5 to 10x faster than sequential processing), and caching entity graph lookups for frequently queried entities.
Total Pipeline Context
Reranker latency should be evaluated in the context of the total pipeline, not in isolation. A typical RAG pipeline takes 600ms to 3 seconds end-to-end, with LLM generation consuming 80 to 90 percent of that time. Adding 30ms of cognitive scoring increases total latency by 1 to 5 percent. Even adding 150ms of cross-encoder reranking only increases total latency by 5 to 15 percent.
The latency that matters is perceived latency: how long the user waits for a response. In streaming applications where the LLM starts outputting tokens before the full response is generated, the user sees the first token within 200 to 500ms regardless of retrieval latency, making the reranking overhead invisible. In non-streaming applications, the reranking time is folded into the total wait, which is dominated by generation time.
The fastest multi-factor reranking available. Adaptive Recall's cognitive scoring adds under 40ms per query with no model inference.
Get Started Free