How to Intermediate · 3 min read

How to build a document comparison tool with AI

Quick answer

Build a document comparison tool by converting documents into vector embeddings using models like gpt-4o or claude-3-5-sonnet-20241022, then compute similarity scores or highlight differences. Use vector databases or cosine similarity on embeddings to identify matching or differing content efficiently.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.

bash

pip install openai

Step by step

This example shows how to load two text documents, generate embeddings using gpt-4o, and compute cosine similarity to compare their content.

python

import os
import numpy as np
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def get_embedding(text):
    response = client.embeddings.create(
        model="gpt-4o",
        input=text
    )
    return np.array(response.data[0].embedding)

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Load documents
with open("doc1.txt", "r", encoding="utf-8") as f:
    doc1 = f.read()
with open("doc2.txt", "r", encoding="utf-8") as f:
    doc2 = f.read()

# Get embeddings
embedding1 = get_embedding(doc1)
embedding2 = get_embedding(doc2)

# Compute similarity
similarity_score = cosine_similarity(embedding1, embedding2)
print(f"Document similarity score: {similarity_score:.4f}")

output

Document similarity score: 0.8723

Common variations

Use claude-3-5-sonnet-20241022 embeddings for potentially better semantic understanding.
Compare documents in chunks for large files to localize differences.
Use async calls or streaming for performance improvements.
Store embeddings in vector databases like FAISS or Chroma for scalable comparisons.

Troubleshooting

If similarity scores are unexpectedly low, verify documents are preprocessed consistently (e.g., remove extra whitespace).
Ensure API key is correctly set in os.environ["OPENAI_API_KEY"].
For large documents, split text into smaller chunks before embedding to avoid token limits.

✅

Key Takeaways

Use embeddings from models like gpt-4o to convert documents into comparable vectors.
Calculate cosine similarity on embeddings to quantify document similarity efficiently.
Chunk large documents and use vector databases for scalable, precise comparisons.
Preprocess text consistently to improve embedding quality and comparison accuracy.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗