How to search legal documents with AI
Quick answer
Use
embedding models like text-embedding-3-small to convert legal documents into vectors, then store them in a vector database such as FAISS. Query by embedding the search text and retrieving the most relevant documents via similarity search with OpenAI or other LLM APIs.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 faiss-cpu
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
- Install OpenAI Python SDK and FAISS for vector search.
- Set
OPENAI_API_KEYin your environment.
pip install openai faiss-cpu output
Collecting openai Collecting faiss-cpu Successfully installed openai-1.x faiss-cpu-1.x
Step by step
This example shows how to embed legal documents, store them in FAISS, and query them with a search phrase.
import os
from openai import OpenAI
import faiss
import numpy as np
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample legal documents
documents = [
"The contract shall be governed by the laws of California.",
"All disputes will be resolved through arbitration.",
"The tenant must provide a 30-day notice before vacating.",
"Confidential information must not be disclosed to third parties."
]
# Embed documents
response = client.embeddings.create(
model="text-embedding-3-small",
input=documents
)
embeddings = [data.embedding for data in response.data]
# Convert to numpy array
embedding_dim = len(embeddings[0])
embeddings_np = np.array(embeddings).astype('float32')
# Create FAISS index
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings_np)
# Query embedding
query = "What are the rules for ending a lease?"
query_response = client.embeddings.create(
model="text-embedding-3-small",
input=[query]
)
query_embedding = np.array(query_response.data[0].embedding).astype('float32')
# Search top 2 similar documents
k = 2
D, I = index.search(np.array([query_embedding]), k)
print("Top matches:")
for idx in I[0]:
print(f"- {documents[idx]}") output
Top matches: - The tenant must provide a 30-day notice before vacating. - The contract shall be governed by the laws of California.
Common variations
You can use asynchronous calls with the OpenAI SDK for better performance in large-scale apps. Also, consider streaming results for interactive search interfaces. Alternative vector databases like Chroma or Pinecone can replace FAISS for cloud scalability. Different embedding models or fine-tuned legal embeddings improve accuracy.
import asyncio
from openai import OpenAI
async def async_search():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
documents = [
"The contract shall be governed by the laws of California.",
"All disputes will be resolved through arbitration.",
"The tenant must provide a 30-day notice before vacating.",
"Confidential information must not be disclosed to third parties."
]
response = await client.embeddings.acreate(
model="text-embedding-3-small",
input=documents
)
embeddings = [data.embedding for data in response.data]
print(f"Embedded {len(embeddings)} documents asynchronously.")
asyncio.run(async_search()) output
Embedded 4 documents asynchronously.
Troubleshooting
- Empty search results: Ensure your query and documents are embedded with the same model and vector dimension.
- API errors: Check your
OPENAI_API_KEYenvironment variable and network connectivity. - FAISS installation issues: Use
faiss-cpufor CPU-only environments or install GPU version if available.
Key Takeaways
- Convert legal documents and queries into vector embeddings using
text-embedding-3-small. - Use a vector database like
FAISSto perform similarity search efficiently. - Keep embedding models consistent between documents and queries for accurate results.
- Async API calls improve throughput for large document collections.
- Troubleshoot by verifying API keys, model consistency, and vector dimensions.