How to build a document comparison tool with AI
Quick answer
Build a document comparison tool by converting documents into vector embeddings using models like
gpt-4o or claude-3-5-sonnet-20241022, then compute similarity scores or highlight differences. Use vector databases or cosine similarity on embeddings to identify matching or differing content efficiently.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.
pip install openai Step by step
This example shows how to load two text documents, generate embeddings using gpt-4o, and compute cosine similarity to compare their content.
import os
import numpy as np
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def get_embedding(text):
response = client.embeddings.create(
model="gpt-4o",
input=text
)
return np.array(response.data[0].embedding)
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Load documents
with open("doc1.txt", "r", encoding="utf-8") as f:
doc1 = f.read()
with open("doc2.txt", "r", encoding="utf-8") as f:
doc2 = f.read()
# Get embeddings
embedding1 = get_embedding(doc1)
embedding2 = get_embedding(doc2)
# Compute similarity
similarity_score = cosine_similarity(embedding1, embedding2)
print(f"Document similarity score: {similarity_score:.4f}") output
Document similarity score: 0.8723
Common variations
- Use
claude-3-5-sonnet-20241022embeddings for potentially better semantic understanding. - Compare documents in chunks for large files to localize differences.
- Use async calls or streaming for performance improvements.
- Store embeddings in vector databases like FAISS or Chroma for scalable comparisons.
Troubleshooting
- If similarity scores are unexpectedly low, verify documents are preprocessed consistently (e.g., remove extra whitespace).
- Ensure API key is correctly set in
os.environ["OPENAI_API_KEY"]. - For large documents, split text into smaller chunks before embedding to avoid token limits.
Key Takeaways
- Use embeddings from models like
gpt-4oto convert documents into comparable vectors. - Calculate cosine similarity on embeddings to quantify document similarity efficiently.
- Chunk large documents and use vector databases for scalable, precise comparisons.
- Preprocess text consistently to improve embedding quality and comparison accuracy.