What is vector similarity search
vector embeddings to find the most similar items based on distance metrics like cosine similarity or Euclidean distance. It enables fast retrieval of semantically related data in applications such as semantic search and recommendation systems.How it works
Vector similarity search works by representing data items as numerical vectors in a high-dimensional space, often generated by AI models like embeddings. It then measures the distance or similarity between these vectors using metrics such as cosine similarity or Euclidean distance. The closer two vectors are, the more semantically similar the underlying data items are. This process is analogous to finding the nearest points on a map, where each point represents an item’s meaning.
Concrete example
Here is a Python example using the OpenAI SDK to perform vector similarity search by embedding queries and comparing cosine similarity scores:
import os
import numpy as np
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding)
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Sample documents
documents = [
"The cat sits on the mat.",
"Dogs are great pets.",
"Artificial intelligence and machine learning.",
"The quick brown fox jumps over the lazy dog."
]
# Embed documents
doc_embeddings = [get_embedding(doc) for doc in documents]
# Query
query = "Pets and animals"
query_embedding = get_embedding(query)
# Compute similarity scores
scores = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
# Find most similar document
most_similar_idx = np.argmax(scores)
print(f"Most similar document: {documents[most_similar_idx]} with score {scores[most_similar_idx]:.4f}") Most similar document: Dogs are great pets. with score 0.87
When to use it
Use vector similarity search when you need to find semantically related items beyond exact keyword matches, such as in semantic search, recommendation engines, image or text retrieval, and clustering. Avoid it when your data is strictly categorical or when exact matches are required, as vector search focuses on meaning and similarity rather than exact equality.
Key terms
| Term | Definition |
|---|---|
| Vector embedding | A numerical representation of data capturing semantic meaning in a high-dimensional space. |
| Cosine similarity | A metric measuring the cosine of the angle between two vectors, indicating similarity. |
| Euclidean distance | The straight-line distance between two points (vectors) in space. |
| Semantic search | Search that retrieves results based on meaning rather than exact keyword matches. |
| Nearest neighbor search | Finding the closest vectors to a query vector in a dataset. |
Key Takeaways
- Vector similarity search uses vector embeddings and distance metrics to find semantically related items.
- It is essential for AI applications like semantic search, recommendations, and clustering.
- Cosine similarity is the most common metric for measuring vector closeness in high-dimensional spaces.