How to Optimize Vector Search for Low Latency
Where Latency Comes From
A vector search query involves four latency components. First, embedding the query text: this is an API call or local model inference, typically 10 to 50 milliseconds for API-based models and 2 to 10 milliseconds for local models. Second, network round-trip to the vector database, typically 1 to 5 milliseconds within the same region. Third, the actual index traversal and distance calculations, typically 1 to 20 milliseconds for HNSW indexes on millions of vectors. Fourth, retrieving the associated metadata or document content for the matched vectors, typically 1 to 5 milliseconds.
Query embedding is often the largest component and the hardest to optimize because it depends on the model provider's infrastructure or your GPU hardware. The optimization steps below focus on the components you can control directly: index performance, distance calculation speed, and result retrieval.
Step-by-Step Optimization
Before optimizing, measure where time is actually spent. Log timestamps at each stage of the search pipeline: before and after embedding, before and after the database query, before and after result processing. Many teams optimize the wrong component because they assume the bottleneck without measuring.
import time
async def profiled_search(query: str, top_k: int = 10):
timings = {}
t0 = time.perf_counter()
embedding = await get_embedding(query)
timings["embed_ms"] = (time.perf_counter() - t0) * 1000
t0 = time.perf_counter()
results = await vector_db.search(embedding, top_k=top_k)
timings["search_ms"] = (time.perf_counter() - t0) * 1000
t0 = time.perf_counter()
enriched = await load_metadata(results)
timings["metadata_ms"] = (time.perf_counter() - t0) * 1000
timings["total_ms"] = sum(timings.values())
return enriched, timingsThe
ef_search parameter controls how many candidates HNSW evaluates during a query. Higher values improve recall but increase latency linearly. The default in most databases is 40 to 64. For sub-10ms queries at 99th percentile, keep ef_search between 40 and 100. For applications where recall matters more than latency, increase to 200 or higher. Build your index with higher m (connections per node) and ef_construction values to improve the base quality, which lets you use a lower ef_search at query time.
-- pgvector: rebuild with higher quality index
DROP INDEX IF EXISTS idx_documents_embedding;
CREATE INDEX idx_documents_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 200);
-- Set search parameter for session
SET hnsw.ef_search = 64;
-- Qdrant: set HNSW params at collection creation
# PUT /collections/documents
{
"vectors": {
"size": 1536,
"distance": "Cosine"
},
"hnsw_config": {
"m": 24,
"ef_construct": 200
}
}
# Override ef at query time
# POST /collections/documents/points/search
{
"vector": [...],
"limit": 10,
"params": {"hnsw_ef": 64}
}Quantization reduces the memory footprint and speeds up distance calculations by representing vectors with fewer bits. Scalar quantization (float32 to int8) gives a 4x size reduction with minimal recall loss (typically under 1%). Product quantization (PQ) gives 8 to 16x reduction with 2 to 5% recall loss. Binary quantization gives 32x reduction but with significant recall degradation unless used as a first-pass filter followed by exact rescoring on the top candidates.
# Qdrant: enable scalar quantization
# PUT /collections/documents
{
"vectors": {
"size": 1536,
"distance": "Cosine"
},
"quantization_config": {
"scalar": {
"type": "int8",
"always_ram": true
}
}
}
# Weaviate: enable PQ compression
# In schema config:
{
"vectorIndexConfig": {
"pq": {
"enabled": true,
"segments": 128,
"centroids": 256
}
}
}If your queries include metadata filters (by date, category, tenant, or any other attribute), apply those filters before the vector search rather than after. Pre-filtering reduces the candidate set that the HNSW index must search through, directly reducing query time. Most vector databases support pre-filtering natively through their query API.
-- pgvector: partial index for common filters
CREATE INDEX idx_docs_recent_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WHERE created_at > now() - interval '90 days';
-- Query with filter (uses partial index automatically)
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE created_at > now() - interval '90 days'
ORDER BY embedding <=> $1::vector
LIMIT 10;
-- Qdrant: filter in search request
# POST /collections/documents/points/search
{
"vector": [...],
"limit": 10,
"filter": {
"must": [
{"key": "tenant_id", "match": {"value": "acme"}},
{"key": "created_at", "range": {"gte": "2026-01-01"}}
]
}
}The single most impactful hardware optimization is ensuring the HNSW index fits entirely in RAM. When the index spills to disk, query latency jumps from single-digit milliseconds to 50 to 200 milliseconds because each hop in the HNSW graph becomes a disk read instead of a memory access. Calculate your index size (roughly 2 to 4x the raw vector size) and provision enough RAM accordingly. For frequently repeated queries, cache the results for a short TTL to avoid redundant index traversal.
# Calculate memory requirements
vector_bytes = num_vectors * dimensions * 4 # float32
index_overhead = vector_bytes * 3 # HNSW overhead ~3x
total_ram_needed = vector_bytes + index_overhead
# Example: 2M vectors at 1536 dims
# Raw: 2M * 1536 * 4 = 11.5 GB
# Index: 11.5 * 3 = 34.5 GB
# Total: ~46 GB RAM recommended
# Redis caching for repeated queries
import redis
import hashlib
import json
cache = redis.Redis()
def cached_search(query: str, top_k: int = 10, ttl: int = 300):
cache_key = f"vsearch:{hashlib.sha256(query.encode()).hexdigest()}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
results = vector_search(query, top_k)
cache.setex(cache_key, ttl, json.dumps(results))
return resultsLatency Targets by Use Case
Interactive search (user types a query and waits): aim for under 200 milliseconds total including embedding. At this latency, the search feels instantaneous. RAG pipeline (vector search feeds context to an LLM): aim for under 100 milliseconds for the vector search step, since the LLM generation will dominate total response time. Agentic workflows (multiple searches per turn): aim for under 50 milliseconds per search, since the agent may run 3 to 5 searches per user interaction and total latency compounds.
Adaptive Recall handles vector search optimization as part of its managed infrastructure, so you focus on your application while retrieval stays fast and accurate.
Try It Free