Cosine vs Dot Product vs Euclidean Distance
Cosine Similarity
Cosine similarity computes the cosine of the angle between two vectors. The formula is the dot product of the two vectors divided by the product of their magnitudes: cos(A, B) = (A . B) / (|A| * |B|). The result ranges from -1 (opposite directions) through 0 (perpendicular) to 1 (same direction). In practice, text embeddings rarely produce negative cosine similarities, so results typically range from 0 to 1.
The key property of cosine similarity is that it ignores vector magnitude. A short document and a long document about the same topic may produce embeddings with different magnitudes (longer text tends to produce slightly larger vectors in some models), but cosine similarity treats them as equally similar to a query about that topic because only the direction matters. This makes cosine similarity robust to variation in input text length.
Cosine distance, used by many vector databases, is simply 1 - cosine_similarity. It ranges from 0 (identical direction) to 2 (opposite direction), with lower values indicating more similar vectors. When a database uses cosine distance, you sort results in ascending order (lowest distance = most similar).
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example
vec_a = np.array([0.5, 0.3, 0.8, 0.1])
vec_b = np.array([0.4, 0.35, 0.75, 0.15])
vec_c = np.array([0.1, 0.9, 0.05, 0.7])
print(cosine_similarity(vec_a, vec_b)) # ~0.99 (very similar)
print(cosine_similarity(vec_a, vec_c)) # ~0.45 (different)Dot Product (Inner Product)
The dot product multiplies corresponding elements of two vectors and sums the results: A . B = sum(a_i * b_i). Unlike cosine similarity, it is sensitive to both direction and magnitude. Two vectors pointing in the same direction but with double the magnitude will have double the dot product. The result is unbounded: it can be any real number, positive or negative.
Some embedding models intentionally use magnitude to encode additional information. A longer, more detailed document might produce a larger vector than a short, vague document, even if they discuss the same topic. In this case, dot product naturally ranks the detailed document higher because its larger magnitude contributes to a higher score. Cosine similarity would rank them equally because it ignores the magnitude difference.
If your embedding model's documentation recommends dot product (also called inner product or IP), use it. This signals that the model was trained with magnitude as a meaningful feature. If the model produces normalized vectors (magnitude close to 1.0 for all inputs), dot product and cosine similarity produce identical rankings.
def dot_product(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b)
# For normalized vectors, dot product equals cosine similarity
vec_a_norm = vec_a / np.linalg.norm(vec_a)
vec_b_norm = vec_b / np.linalg.norm(vec_b)
print(np.dot(vec_a_norm, vec_b_norm)) # same as cosine_similarity
print(cosine_similarity(vec_a_norm, vec_b_norm)) # identicalEuclidean Distance (L2 Distance)
Euclidean distance measures the straight-line distance between two points in the vector space: dist(A, B) = sqrt(sum((a_i - b_i)^2)). Smaller distances mean more similar vectors. It is the most geometrically intuitive metric because it corresponds to our physical understanding of distance.
Euclidean distance is sensitive to both direction and magnitude, similar to dot product but inverted (closer = more similar instead of higher = more similar). For normalized vectors, Euclidean distance has a direct mathematical relationship to cosine similarity: dist^2 = 2 * (1 - cosine_similarity). This means sorting by Euclidean distance produces the same ranking as sorting by cosine similarity when vectors are normalized.
Euclidean distance is slightly faster to compute than cosine similarity because it does not require the magnitude normalization step. Some vector databases default to L2 distance for this reason, though the difference is negligible in practice because HNSW index traversal dominates query time, not individual distance calculations.
def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
return np.linalg.norm(a - b)
# For normalized vectors, ranking is identical to cosine
print(euclidean_distance(vec_a_norm, vec_b_norm)) # small = similar
print(euclidean_distance(vec_a_norm, vec_c / np.linalg.norm(vec_c))) # largerWhich Metric to Use
Use cosine similarity as your default. It works correctly with both normalized and non-normalized vectors, is universally supported by vector databases, and is what most embedding model documentation assumes. If you are unsure which metric your model expects, cosine similarity is the safe choice.
Use dot product when your embedding model explicitly recommends it. This typically means the model encodes quality or specificity information in vector magnitude. Check the model's documentation or paper for this recommendation. OpenAI's text-embedding-3 family, Cohere embed-v4, and most open-source models produce normalized vectors where dot product and cosine are equivalent.
Use Euclidean distance when you need spatial relationships (clustering, nearest-neighbor graphs) or when your database defaults to it and switching is unnecessary. For normalized vectors, it produces the same ranking as cosine similarity.
Database Configuration
# pgvector distance operators:
# <=> cosine distance
# <#> negative inner product (for ORDER BY ascending)
# <-> L2 distance
# Pinecone: set at index creation
# metric: "cosine" | "dotproduct" | "euclidean"
# Qdrant: set at collection creation
# distance: "Cosine" | "Dot" | "Euclid"
# Weaviate: set in schema
# vectorIndexConfig.distance: "cosine" | "dot" | "l2-squared"Adaptive Recall handles distance metric selection as part of its managed vector infrastructure, combining similarity scores with cognitive activation and graph traversal for richer ranking.
Try It Free