How to find similar documents using OpenAI embeddings
Quick answer
Use OpenAI's
text-embedding-3-large model to generate vector embeddings for your documents, then store these embeddings in a vector database like FAISS. Query similarity by embedding the input text and retrieving nearest neighbors based on cosine similarity.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 faiss-cpu numpy
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
- Install packages:
pip install openai faiss-cpu numpy - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai faiss-cpu numpy Step by step
This example shows how to embed a list of documents, store them in a FAISS index, and then find the most similar documents to a query.
import os
import numpy as np
import faiss
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample documents
documents = [
"The Eiffel Tower is located in Paris.",
"The Great Wall of China is visible from space.",
"Python is a popular programming language.",
"OpenAI develops advanced AI models.",
"The Statue of Liberty is in New York City."
]
# Function to get embeddings for a list of texts
def get_embeddings(texts):
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts
)
embeddings = [data.embedding for data in response.data]
return np.array(embeddings).astype("float32")
# Generate embeddings for documents
doc_embeddings = get_embeddings(documents)
# Create FAISS index (cosine similarity via normalized vectors)
index = faiss.IndexFlatIP(doc_embeddings.shape[1])
# Normalize embeddings for cosine similarity
faiss.normalize_L2(doc_embeddings)
index.add(doc_embeddings)
# Query to find similar documents
query = "Where is the Eiffel Tower located?"
query_embedding = get_embeddings([query])
faiss.normalize_L2(query_embedding)
# Search top 3 similar documents
k = 3
distances, indices = index.search(query_embedding, k)
print("Query:", query)
print("Top similar documents:")
for i, idx in enumerate(indices[0]):
print(f"{i+1}. {documents[idx]} (score: {distances[0][i]:.4f})") output
Query: Where is the Eiffel Tower located? Top similar documents: 1. The Eiffel Tower is located in Paris. (score: 0.9987) 2. The Statue of Liberty is in New York City. (score: 0.7892) 3. The Great Wall of China is visible from space. (score: 0.6543)
Common variations
- Use
gpt-4oor other embedding-capable models if available. - Use async calls with
asynciofor batch embedding. - Replace
FAISSwith other vector stores likeChromaorWeaviatefor persistence and scalability. - Adjust
kin the search to return more or fewer results.
Troubleshooting
- If embeddings return empty or errors, verify your API key and network connection.
- Ensure input texts are not empty or too long (max tokens limit).
- If FAISS index search returns no results, check that embeddings are normalized for cosine similarity.
- For large datasets, consider using approximate nearest neighbor indexes like
IndexIVFFlatin FAISS.
Key Takeaways
- Generate embeddings with OpenAI's
text-embedding-3-largemodel for accurate vector representations. - Use FAISS to efficiently index and search document embeddings by cosine similarity.
- Normalize embeddings before adding to FAISS for correct similarity scoring.
- Adjust the number of neighbors
kto control how many similar documents you retrieve. - For production, consider persistent vector stores and batch embedding for scalability.