Home » Vector Search and Embeddings » Optimize for Low Latency

How to Optimize Vector Search for Low Latency

Vector search latency in production typically ranges from 5 to 50 milliseconds depending on index type, vector count, and hardware. Most of this time is spent in HNSW graph traversal and distance calculations, not network overhead. This guide covers the concrete optimization steps that bring query latency down, from tuning index parameters to applying quantization and caching.

Where Latency Comes From

A vector search query involves four latency components. First, embedding the query text: this is an API call or local model inference, typically 10 to 50 milliseconds for API-based models and 2 to 10 milliseconds for local models. Second, network round-trip to the vector database, typically 1 to 5 milliseconds within the same region. Third, the actual index traversal and distance calculations, typically 1 to 20 milliseconds for HNSW indexes on millions of vectors. Fourth, retrieving the associated metadata or document content for the matched vectors, typically 1 to 5 milliseconds.

Query embedding is often the largest component and the hardest to optimize because it depends on the model provider's infrastructure or your GPU hardware. The optimization steps below focus on the components you can control directly: index performance, distance calculation speed, and result retrieval.

Step-by-Step Optimization

Step 1: Profile your current latency.
Before optimizing, measure where time is actually spent. Log timestamps at each stage of the search pipeline: before and after embedding, before and after the database query, before and after result processing. Many teams optimize the wrong component because they assume the bottleneck without measuring.
import time async def profiled_search(query: str, top_k: int = 10): timings = {} t0 = time.perf_counter() embedding = await get_embedding(query) timings["embed_ms"] = (time.perf_counter() - t0) * 1000 t0 = time.perf_counter() results = await vector_db.search(embedding, top_k=top_k) timings["search_ms"] = (time.perf_counter() - t0) * 1000 t0 = time.perf_counter() enriched = await load_metadata(results) timings["metadata_ms"] = (time.perf_counter() - t0) * 1000 timings["total_ms"] = sum(timings.values()) return enriched, timings
Step 2: Tune HNSW parameters.
The ef_search parameter controls how many candidates HNSW evaluates during a query. Higher values improve recall but increase latency linearly. The default in most databases is 40 to 64. For sub-10ms queries at 99th percentile, keep ef_search between 40 and 100. For applications where recall matters more than latency, increase to 200 or higher. Build your index with higher m (connections per node) and ef_construction values to improve the base quality, which lets you use a lower ef_search at query time.
-- pgvector: rebuild with higher quality index DROP INDEX IF EXISTS idx_documents_embedding; CREATE INDEX idx_documents_embedding ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 24, ef_construction = 200); -- Set search parameter for session SET hnsw.ef_search = 64; -- Qdrant: set HNSW params at collection creation # PUT /collections/documents { "vectors": { "size": 1536, "distance": "Cosine" }, "hnsw_config": { "m": 24, "ef_construct": 200 } } # Override ef at query time # POST /collections/documents/points/search { "vector": [...], "limit": 10, "params": {"hnsw_ef": 64} }
Step 3: Apply vector quantization.
Quantization reduces the memory footprint and speeds up distance calculations by representing vectors with fewer bits. Scalar quantization (float32 to int8) gives a 4x size reduction with minimal recall loss (typically under 1%). Product quantization (PQ) gives 8 to 16x reduction with 2 to 5% recall loss. Binary quantization gives 32x reduction but with significant recall degradation unless used as a first-pass filter followed by exact rescoring on the top candidates.
# Qdrant: enable scalar quantization # PUT /collections/documents { "vectors": { "size": 1536, "distance": "Cosine" }, "quantization_config": { "scalar": { "type": "int8", "always_ram": true } } } # Weaviate: enable PQ compression # In schema config: { "vectorIndexConfig": { "pq": { "enabled": true, "segments": 128, "centroids": 256 } } }
Step 4: Implement pre-filtering.
If your queries include metadata filters (by date, category, tenant, or any other attribute), apply those filters before the vector search rather than after. Pre-filtering reduces the candidate set that the HNSW index must search through, directly reducing query time. Most vector databases support pre-filtering natively through their query API.
-- pgvector: partial index for common filters CREATE INDEX idx_docs_recent_embedding ON documents USING hnsw (embedding vector_cosine_ops) WHERE created_at > now() - interval '90 days'; -- Query with filter (uses partial index automatically) SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity FROM documents WHERE created_at > now() - interval '90 days' ORDER BY embedding <=> $1::vector LIMIT 10; -- Qdrant: filter in search request # POST /collections/documents/points/search { "vector": [...], "limit": 10, "filter": { "must": [ {"key": "tenant_id", "match": {"value": "acme"}}, {"key": "created_at", "range": {"gte": "2026-01-01"}} ] } }
Step 5: Optimize hardware and caching.
The single most impactful hardware optimization is ensuring the HNSW index fits entirely in RAM. When the index spills to disk, query latency jumps from single-digit milliseconds to 50 to 200 milliseconds because each hop in the HNSW graph becomes a disk read instead of a memory access. Calculate your index size (roughly 2 to 4x the raw vector size) and provision enough RAM accordingly. For frequently repeated queries, cache the results for a short TTL to avoid redundant index traversal.
# Calculate memory requirements vector_bytes = num_vectors * dimensions * 4 # float32 index_overhead = vector_bytes * 3 # HNSW overhead ~3x total_ram_needed = vector_bytes + index_overhead # Example: 2M vectors at 1536 dims # Raw: 2M * 1536 * 4 = 11.5 GB # Index: 11.5 * 3 = 34.5 GB # Total: ~46 GB RAM recommended # Redis caching for repeated queries import redis import hashlib import json cache = redis.Redis() def cached_search(query: str, top_k: int = 10, ttl: int = 300): cache_key = f"vsearch:{hashlib.sha256(query.encode()).hexdigest()}" cached = cache.get(cache_key) if cached: return json.loads(cached) results = vector_search(query, top_k) cache.setex(cache_key, ttl, json.dumps(results)) return results

Latency Targets by Use Case

Interactive search (user types a query and waits): aim for under 200 milliseconds total including embedding. At this latency, the search feels instantaneous. RAG pipeline (vector search feeds context to an LLM): aim for under 100 milliseconds for the vector search step, since the LLM generation will dominate total response time. Agentic workflows (multiple searches per turn): aim for under 50 milliseconds per search, since the agent may run 3 to 5 searches per user interaction and total latency compounds.

Adaptive Recall handles vector search optimization as part of its managed infrastructure, so you focus on your application while retrieval stays fast and accurate.

Try It Free