How to do similarity search with embeddings
Quick answer
Use a model like
text-embedding-3-small to convert text into vector embeddings, then compute similarity (e.g., cosine similarity) between vectors to find the closest matches. Store embeddings in a vector database or in-memory structure for efficient retrieval.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install numpy
Setup
Install the openai Python package and set your API key as an environment variable. Also install numpy for vector math.
pip install openai numpy Step by step
This example shows how to embed a list of documents, embed a query, and find the most similar document using cosine similarity.
import os
import numpy as np
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Sample documents
documents = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence and machine learning are fascinating.",
"OpenAI provides powerful language models."
]
# Embed documents
doc_embeddings = [get_embedding(doc) for doc in documents]
# Query
query = "Tell me about AI and models"
query_embedding = get_embedding(query)
# Compute similarities
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
# Find best match
best_idx = np.argmax(similarities)
print(f"Most similar document: {documents[best_idx]}") output
Most similar document: Artificial intelligence and machine learning are fascinating.
Common variations
- Use vector databases like
FAISSorChromafor scalable similarity search. - Use async calls if embedding large batches.
- Try different embedding models like
text-embedding-3-largefor higher quality. - Use other similarity metrics like Euclidean distance if preferred.
Troubleshooting
- If embeddings are slow, batch inputs to reduce API calls.
- If similarity scores are low, verify text preprocessing (e.g., lowercase, remove punctuation).
- Check your API key and environment variable if authentication errors occur.
- Ensure
numpyis installed for vector math.
Key Takeaways
- Convert text to embeddings using
text-embedding-3-smallfor similarity search. - Compute cosine similarity between query and document embeddings to find closest matches.
- Use vector databases like FAISS for large-scale, efficient similarity search.
- Batch embedding requests to optimize API usage and speed.
- Preprocess text consistently to improve embedding quality and similarity accuracy.