How to beginner · 3 min read

How to find similar documents using OpenAI embeddings

Q: How to find similar documents using OpenAI embeddings

Use OpenAI's text-embedding-3-large model to generate vector embeddings for your documents, then store these embeddings in a vector database like FAISS. Query similarity by embedding the input text and retrieving nearest neighbors based on cosine similarity.

Quick answer

Use OpenAI's text-embedding-3-large model to generate vector embeddings for your documents, then store these embeddings in a vector database like FAISS. Query similarity by embedding the input text and retrieving nearest neighbors based on cosine similarity.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 faiss-cpu numpy

Setup

Install the required Python packages and set your OpenAI API key as an environment variable.

Install packages: pip install openai faiss-cpu numpy
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)

bash

pip install openai faiss-cpu numpy

Step by step

This example shows how to embed a list of documents, store them in a FAISS index, and then find the most similar documents to a query.

python

import os
import numpy as np
import faiss
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents
documents = [
    "The Eiffel Tower is located in Paris.",
    "The Great Wall of China is visible from space.",
    "Python is a popular programming language.",
    "OpenAI develops advanced AI models.",
    "The Statue of Liberty is in New York City."
]

# Function to get embeddings for a list of texts
def get_embeddings(texts):
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts
    )
    embeddings = [data.embedding for data in response.data]
    return np.array(embeddings).astype("float32")

# Generate embeddings for documents
doc_embeddings = get_embeddings(documents)

# Create FAISS index (cosine similarity via normalized vectors)
index = faiss.IndexFlatIP(doc_embeddings.shape[1])
# Normalize embeddings for cosine similarity
faiss.normalize_L2(doc_embeddings)
index.add(doc_embeddings)

# Query to find similar documents
query = "Where is the Eiffel Tower located?"
query_embedding = get_embeddings([query])
faiss.normalize_L2(query_embedding)

# Search top 3 similar documents
k = 3
distances, indices = index.search(query_embedding, k)

print("Query:", query)
print("Top similar documents:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. {documents[idx]} (score: {distances[0][i]:.4f})")

output

Query: Where is the Eiffel Tower located?
Top similar documents:
1. The Eiffel Tower is located in Paris. (score: 0.9987)
2. The Statue of Liberty is in New York City. (score: 0.7892)
3. The Great Wall of China is visible from space. (score: 0.6543)

Common variations

Use gpt-4o or other embedding-capable models if available.
Use async calls with asyncio for batch embedding.
Replace FAISS with other vector stores like Chroma or Weaviate for persistence and scalability.
Adjust k in the search to return more or fewer results.

Troubleshooting

If embeddings return empty or errors, verify your API key and network connection.
Ensure input texts are not empty or too long (max tokens limit).
If FAISS index search returns no results, check that embeddings are normalized for cosine similarity.
For large datasets, consider using approximate nearest neighbor indexes like IndexIVFFlat in FAISS.

✅

Key Takeaways

Generate embeddings with OpenAI's text-embedding-3-large model for accurate vector representations.
Use FAISS to efficiently index and search document embeddings by cosine similarity.
Normalize embeddings before adding to FAISS for correct similarity scoring.
Adjust the number of neighbors k to control how many similar documents you retrieve.
For production, consider persistent vector stores and batch embedding for scalability.

Verified 2026-04 · text-embedding-3-large, gpt-4o

Verify ↗