Why Cosine Similarity Alone Returns Bad Results
What Cosine Similarity Actually Measures
Cosine similarity computes the cosine of the angle between two vectors in high-dimensional space. When text is converted to embedding vectors using a model like OpenAI's text-embedding-3 or Cohere's embed-v3, semantically similar texts produce vectors that point in similar directions, yielding cosine similarity values close to 1.0. Dissimilar texts produce vectors pointing in different directions, yielding values closer to 0.
This is genuinely useful. It captures synonymy ("automobile" and "car" have similar embeddings), paraphrase ("how to fix a bug" and "debugging techniques" score highly), and topical relevance ("machine learning" and "neural networks" are related). For simple document retrieval, where you want to find documents about the same topic as the query, cosine similarity works well.
The problem is that AI memory systems need more than topical relevance. A memory system serves as the persistent knowledge layer for an AI application, and the quality of its retrieval directly determines the quality of the AI's responses. Getting the "most similar" text is not the same as getting the "best answer," and the gap between these two things grows as the memory store accumulates data over time.
Five Failure Modes of Cosine-Only Retrieval
1. Stale Information Ranks Equally
Cosine similarity is time-blind. A product specification from two years ago has the same similarity score to a current query as an updated specification from last week. If the old spec uses similar vocabulary (which it almost certainly does, since it describes the same product), it may even score slightly higher because it might be more comprehensive or use more keyword-rich language.
In practice, this means retrieval systems that grow over time develop an increasing noise problem. Every stored version of a fact competes equally for retrieval, and the user gets a mix of current and outdated information with no way to distinguish between them from the ranking alone. Manual solutions (deleting old versions, tagging content with expiration dates, maintaining version numbers) require ongoing maintenance that scales linearly with the volume of stored data.
2. Vocabulary Mismatches Cause Misses
While cosine similarity handles synonyms better than keyword search, it still struggles with vocabulary mismatches between how information is stored and how it is queried. A developer asking "why are users getting 401 errors" has moderate similarity to a memory that says "we rotated the JWT signing key on Tuesday." The connection between 401 errors and JWT signing keys requires domain knowledge that is not captured in the embedding vectors.
This is the gap that spreading activation fills. If both "authentication" and "JWT" appear as entities in the knowledge graph, querying about authentication errors activates JWT-related memories through entity connections, regardless of text similarity. But cosine-only retrieval has no mechanism for this kind of contextual bridging.
3. No Distinction Between Reliable and Unreliable Information
A user casually mentioning "I think we use PostgreSQL 14" and a deployment runbook documenting "PostgreSQL 15.3, deployed 2026-01-15, connection pool: PgBouncer" have similar cosine similarity to a query about database configuration. Cosine similarity cannot distinguish between a verified, authoritative source and an off-hand remark because it only measures text similarity, not information quality.
Confidence scoring addresses this by tracking how well-established each memory is. The deployment runbook, corroborated by multiple subsequent memories about PostgreSQL 15 connections, accumulates high confidence. The casual remark, never corroborated and eventually contradicted, loses confidence. Retrieval ranks the runbook higher because its confidence multiplier is larger.
4. Usage Patterns Are Invisible
Some information is retrieved frequently because it is genuinely useful. A coding pattern, a configuration value, a decision rationale that gets referenced repeatedly demonstrates its value through usage. Cosine similarity ignores this signal entirely. A memory retrieved 50 times ranks the same as a memory never retrieved, as long as their text is equally similar to the query.
Base-level activation captures this. Memories that are retrieved frequently accumulate access events that boost their activation. The coding pattern referenced every day has far higher activation than the one-off observation from three months ago, so it ranks higher even when both have identical cosine similarity to the query.
5. No Self-Curation
Systems that use only cosine similarity have no mechanism for managing their own content quality. Every piece of stored information has equal standing forever. The system cannot identify redundant memories, outdated facts, contradictory statements, or low-value noise. All of these problems must be managed externally through manual curation, automated cleanup scripts, or explicit deletion policies.
A cognitive scoring system self-curates through its activation dynamics. Unused memories decay below the retrieval threshold. Contradicted memories lose confidence. Redundant memories are identified and consolidated through the reflect process. The system maintains retrieval quality automatically, which means it scales to large memory stores without requiring proportionally more maintenance effort.
What Cosine Similarity Is Good For
Despite these limitations, cosine similarity is not something to discard. It excels at its core job: finding text that is semantically similar to a query. In a cognitive scoring pipeline, cosine similarity is the first stage that narrows the candidate set from the entire memory store to a manageable number of relevant items. The other scoring dimensions (activation, spreading activation, confidence) then rerank these candidates to produce the final result set.
The key insight is that cosine similarity is necessary but not sufficient. It answers "what is this text about?" but not "is this the best answer right now?" For AI memory systems that need to provide the best answer, the additional dimensions of cognitive scoring are what transform a text similarity engine into an intelligent retrieval system.
The Combined Approach
Adaptive Recall uses cosine similarity as one of four scoring dimensions. Vector similarity typically receives 40% weight in the combined score, with base-level activation at 30%, spreading activation at 20%, and confidence at 10%. This means semantic relevance is still the dominant factor, but recency, contextual connections, and reliability all influence the final ranking. The result is retrieval that finds semantically relevant results and then orders them by how useful, current, and trustworthy they are.
Go beyond cosine similarity. Adaptive Recall adds recency, frequency, contextual connections, and confidence to every retrieval call.
Get Started Free