Understanding Distance Metrics in Vector Databases

In the realm of vector databases and Retrieval Augmented Generation (RAG) systems, understanding how to measure the 'closeness' or 'similarity' between data points is paramount. This is where distance metrics come into play. They are the mathematical tools that quantify the difference between two vectors, enabling efficient searching and retrieval of relevant information.

What are Distance Metrics?

Distance metrics, also known as similarity measures or dissimilarity measures, are functions that define a distance between data points. In the context of vector embeddings, these points are numerical representations of text, images, or other data. A smaller distance generally implies greater similarity between the vectors.

Distance metrics quantify the 'difference' between vectors.

These mathematical functions help us understand how alike or unlike two data points are when represented as vectors. A lower score typically means the vectors are more similar.

The choice of distance metric can significantly impact the performance and relevance of search results in vector databases. Different metrics are sensitive to different aspects of vector relationships, making some more suitable for specific types of data or similarity definitions.

Commonly Used Distance Metrics

Several distance metrics are widely used in vector databases. Each has its own mathematical formulation and is suited for different scenarios.

Metric	Description	Formula (Simplified)	Use Case
Cosine Similarity	Measures the cosine of the angle between two vectors. It's insensitive to vector magnitude, focusing on orientation.	cos(θ) = (A · B) / (\|\|A\|\| \|\|B\|\|)	Text similarity, document analysis, recommendation systems.
Euclidean Distance	The straight-line distance between two points in Euclidean space. Sensitive to vector magnitude.	√Σ(aᵢ - bᵢ)²	Clustering, image similarity, general-purpose similarity.
Manhattan Distance (L1 Norm)	The sum of the absolute differences of their Cartesian coordinates. Also known as taxicab distance.	Σ\|aᵢ - bᵢ\|	Feature selection, scenarios where magnitude differences are important.
Dot Product	Measures the projection of one vector onto another. Related to cosine similarity but also considers magnitude.	A · B	When both direction and magnitude are important, often used in neural networks.

Cosine Similarity

Cosine similarity is perhaps the most popular metric for text embeddings. It calculates the cosine of the angle between two vectors. A value of 1 means the vectors are identical in direction, 0 means they are orthogonal (unrelated), and -1 means they are diametrically opposed. Its strength lies in its ability to ignore the magnitude of the vectors, focusing solely on their orientation, which is often more indicative of semantic similarity in text.

Imagine two arrows originating from the same point. Cosine similarity measures how much these arrows point in the same direction. If they point exactly the same way, the similarity is 1. If they are at a 90-degree angle, they are unrelated (similarity 0). If they point in opposite directions, they are completely dissimilar (similarity -1). This is particularly useful for text because the 'length' of a word embedding vector can sometimes be influenced by factors like word frequency, which we might want to ignore when assessing semantic meaning.

📚

Text-based content

Library pages focus on text content

Euclidean Distance

Euclidean distance, often referred to as the L2 norm, calculates the shortest straight-line distance between two points in a multi-dimensional space. It's the most intuitive distance measure, akin to using a ruler. However, it is sensitive to the magnitude of the vectors. If one vector is much longer than another, even if they point in a similar direction, the Euclidean distance will be larger, potentially indicating less similarity than cosine similarity would.

Which distance metric is sensitive to the magnitude of vectors and calculates the straight-line distance between two points?

Euclidean Distance

Manhattan Distance (L1 Norm)

Manhattan distance, also known as L1 norm or taxicab distance, calculates the distance by summing the absolute differences of the coordinates. Imagine navigating a city grid where you can only move along streets (horizontally or vertically). This metric can be more robust to outliers than Euclidean distance and is useful when the absolute differences along each dimension are meaningful.

Dot Product

The dot product is a fundamental operation in linear algebra. When applied to normalized vectors (vectors with a magnitude of 1), it is equivalent to cosine similarity. However, when applied to unnormalized vectors, it considers both the angle between them and their magnitudes. A larger dot product generally indicates greater similarity, especially when magnitudes are also large.

Choosing the Right Metric

The selection of a distance metric depends heavily on the nature of the data and the specific task. For semantic similarity in text embeddings, cosine similarity is often preferred. For tasks where the absolute difference in feature values is critical, Euclidean or Manhattan distance might be more appropriate. Experimentation and understanding the properties of your embeddings are key to making the right choice.

In RAG systems, the choice of distance metric directly influences which documents are retrieved and therefore what context is provided to the LLM. A mismatch can lead to irrelevant or incomplete context.

Learning Resources

Understanding Cosine Similarity(blog)

A clear explanation of cosine similarity, its formula, and why it's a popular choice for vector embeddings.

Euclidean Distance Explained(wikipedia)

Provides a fundamental understanding of Euclidean distance with visual examples and mathematical definitions.

Vector Similarity Metrics(documentation)

Documentation from Milvus detailing various vector similarity metrics, including their properties and use cases.

Distance Metrics in Machine Learning(blog)

An article that covers various distance metrics used in machine learning, offering a broader perspective.

Manhattan Distance (L1 Norm)(blog)

Explains Manhattan distance and its applications in machine learning contexts.

The Math Behind Vector Embeddings(blog)

A comprehensive blog post that touches upon distance metrics as part of understanding vector embeddings.

Vector Databases: A Deep Dive(blog)

An overview of vector databases that discusses the role of similarity search and distance metrics.

Understanding the Dot Product(documentation)

Detailed explanation of the dot product from a mathematical perspective.

RAG Explained: Retrieval Augmented Generation(blog)

Explains RAG systems, highlighting the importance of retrieval and the underlying similarity search mechanisms.

Similarity Measures(documentation)

Official scikit-learn documentation on various metrics, including distance and similarity measures.