Home » Cognitive Scoring » Reranker Slowdown

How Much Does Adding a Reranker Slow Down Search

The slowdown depends on the reranking method. Cognitive scoring adds 15 to 40 milliseconds using precomputed metadata with no model inference. Cross-encoder models add 50 to 200 milliseconds depending on model size and whether you use GPU or CPU. LLM-as-a-judge adds 500 milliseconds to 2 seconds per query. For most applications, the retrieval overhead is small compared to the LLM generation step, which typically takes 500ms to several seconds.

Latency Breakdown by Method

Method	20 Candidates	50 Candidates	Hardware
Cognitive scoring	15-25ms	25-40ms	CPU only
MiniLM cross-encoder	15-30ms	30-70ms	GPU
BGE-reranker-v2	80-150ms	180-350ms	GPU
Cohere Rerank API	100-200ms	150-300ms	Hosted
LLM-as-a-judge	500-1500ms	1000-3000ms	API call

Where the Time Goes

Cognitive scoring spends most of its time on entity graph traversal for spreading activation (5 to 20ms). Base-level activation calculation, confidence weighting, and score combination are all sub-millisecond operations on precomputed values. If you disable spreading activation and use only base-level activation and confidence, the overhead drops to under 5ms.

Cross-encoder models spend their time on transformer inference: tokenizing the query-document pairs, running them through the model's attention layers, and extracting the relevance score. Larger models (BGE-reranker-v2 at 560M parameters) take longer than smaller models (MiniLM at 22M parameters). GPU inference is 5 to 10 times faster than CPU inference for these models. Batch processing (scoring all candidates in one pass) is critical because it amortizes GPU kernel launch overhead.

LLM-as-a-judge spends time on the full LLM inference cycle: API call overhead (network round trip), token generation for the relevance evaluation, and response parsing. Even with parallel evaluation of all candidates, the minimum latency is bounded by a single LLM call (typically 300 to 800ms), and the maximum grows with candidate count because API rate limits may force sequential processing.

Optimizing Reranker Latency

The most effective optimization is reducing the candidate count. Reranking 20 candidates instead of 50 roughly halves the latency for model-based methods (cross-encoders process candidates sequentially or in small batches). For cognitive scoring, the reduction is less dramatic because the graph lookup is the bottleneck, not per-candidate computation.

Other optimizations include quantizing cross-encoder models to INT8 (2x speedup, minimal accuracy loss), using ONNX Runtime instead of PyTorch (30 to 50 percent speedup from graph optimization), batching candidates into a single inference call (5 to 10x faster than sequential processing), and caching entity graph lookups for frequently queried entities.

Total Pipeline Context

Reranker latency should be evaluated in the context of the total pipeline, not in isolation. A typical RAG pipeline takes 600ms to 3 seconds end-to-end, with LLM generation consuming 80 to 90 percent of that time. Adding 30ms of cognitive scoring increases total latency by 1 to 5 percent. Even adding 150ms of cross-encoder reranking only increases total latency by 5 to 15 percent.

The latency that matters is perceived latency: how long the user waits for a response. In streaming applications where the LLM starts outputting tokens before the full response is generated, the user sees the first token within 200 to 500ms regardless of retrieval latency, making the reranking overhead invisible. In non-streaming applications, the reranking time is folded into the total wait, which is dominated by generation time.

The fastest multi-factor reranking available. Adaptive Recall's cognitive scoring adds under 40ms per query with no model inference.

Get Started Free

How Much Does Adding a Reranker Slow Down Search

Latency Breakdown by Method

Where the Time Goes

Optimizing Reranker Latency

Total Pipeline Context

Related Articles