Home » Vector Search and Embeddings » Fails on Keywords

Why Vector Search Fails on Exact Keywords

Vector search fails on exact-match queries because embedding models encode semantic meaning, not lexical identity. When a user searches for "ERR_CONN_TIMEOUT" or "v3.2.1" or "JIRA-4521," the embedding model produces a vector that represents the general concept (an error, a version, a ticket) but not the specific string. The document containing that exact string may have an embedding that focuses on the surrounding context rather than the identifier itself. This is the fundamental limitation that makes hybrid search necessary for production applications.

How Embeddings Handle Keywords

Embedding models process text through a tokenizer that splits input into subword tokens. The string "ERR_CONN_TIMEOUT" might be tokenized as ["ERR", "_", "CONN", "_", "TIMEOUT"] or even more fragmented depending on the tokenizer. Each token is mapped to a learned vector, and the model combines these token vectors into a single embedding for the full input. The problem is that the model has seen very few training examples containing "ERR_CONN_TIMEOUT" specifically, so it cannot learn that this particular combination of tokens has a specific, important meaning.

Compare this to how the model handles "database connection timeout." The model has seen thousands of training examples with these words in various combinations and contexts. It has learned rich associations: database relates to SQL, connections, queries, performance; timeout relates to latency, errors, configuration, waiting. The resulting embedding is information-dense and distinguishes this topic well from unrelated topics.

For "ERR_CONN_TIMEOUT," the model does its best: it recognizes "CONN" as related to "connection" and "TIMEOUT" as related to timeouts, so the embedding points in the general direction of connection timeout errors. But it lacks the specificity to distinguish "ERR_CONN_TIMEOUT" from "ERR_CONN_REFUSED" or "ERR_SOCKET_TIMEOUT" or any other connection error code. All of these produce similar embeddings because the model sees them as the same concept (connection error) expressed with different tokens.

The Five Failure Categories

Error codes and status codes. Searches for "HTTP 429," "ENOENT," "ORA-12154," or "SIGKILL" return documents about the general topic (HTTP errors, file system errors, Oracle errors, process signals) rather than documents specifically about that code. A document about HTTP 503 may rank higher than the document about HTTP 429 if the 503 document has richer context about HTTP errors in general.

Version numbers and identifiers. Searching for "React 18.2.0" returns documents about React generally, not the specific document that mentions version 18.2.0. The embedding model treats "18.2.0" as a weakly meaningful numeric string rather than a critical distinguishing identifier. Similarly, searching for a specific configuration parameter name, API endpoint path, or environment variable name produces embeddings focused on the surrounding context rather than the exact string.

Proper nouns and product names. Specialized product names, internal tool names, and code names that the embedding model has not seen in training data produce weak embeddings. If your internal deployment tool is called "Rocketship," the model does not know that "Rocketship" refers to your deployment infrastructure and may associate the embedding with actual rocketships or space travel.

Code patterns and syntax. Searching for a specific function signature like def process_batch(items: List[dict], batch_size: int) returns documents about batch processing generally rather than the specific function definition. The embedding captures the concept but not the exact syntax.

Abbreviations and acronyms. Domain-specific abbreviations ("K8s" for Kubernetes, "PG" for PostgreSQL, "tf" for Terraform) may not be recognized by the embedding model, especially if the abbreviation is ambiguous. "PG" could mean PostgreSQL, a movie rating, or a proper name, and the model produces a vague embedding that partially matches all interpretations.

Why Keyword Search Handles These Cases

Keyword search (BM25) performs exact string matching against an inverted index. It does not need to understand what "ERR_CONN_TIMEOUT" means. It just needs to find documents that contain that exact string. This is trivial for a keyword search engine: the inverted index maps the token "ERR_CONN_TIMEOUT" directly to every document that contains it, regardless of context or meaning.

This is why hybrid search (combining vector and keyword search) is the standard recommendation for production retrieval. Vector search handles the 70 to 80% of queries that are semantic ("how to fix connection issues," "explain the deployment process"). Keyword search handles the 20 to 30% of queries that are exact-match ("ERR_CONN_TIMEOUT," "version 3.2.1"). Together, they cover both cases.

Mitigation Strategies Beyond Hybrid Search

Metadata fields for exact-match attributes. Extract error codes, version numbers, ticket IDs, and other exact-match identifiers into structured metadata fields during indexing. Then use metadata filtering to match these fields exactly before or alongside vector search. This is more reliable than keyword search for structured identifiers because it does not depend on tokenization.

Keyword expansion in embeddings. When embedding a document, prepend or append important keywords and identifiers to the text before embedding. A document about an error could be embedded as "ERR_CONN_TIMEOUT error connection timeout refused" followed by the original content. This forces the embedding to give more weight to the specific identifier.

Custom embedding fine-tuning. Fine-tune the embedding model on your domain-specific data, including examples where specific identifiers should match specific documents. This teaches the model that "ERR_CONN_TIMEOUT" is distinct from "ERR_CONN_REFUSED." Fine-tuning requires training data and infrastructure but produces the best results for highly specialized domains.

Adaptive Recall addresses this limitation through multiple retrieval signals. Vector similarity handles semantic queries. The knowledge graph maps specific entities (error codes, service names, configurations) to their connections, so a search for "ERR_CONN_TIMEOUT" traverses the graph to the specific service and configuration that produces that error. Cognitive scoring boosts memories that have been accessed in the context of connection timeout debugging. Together, these signals find the right memory even when vector similarity alone would return generic results.

Adaptive Recall goes beyond vector similarity. Entity recognition, graph traversal, and cognitive scoring find what keyword and vector search miss individually.

Try It Free