How to Intermediate · 4 min read

How to build AI-powered document classifier

Quick answer
Build an AI-powered document classifier by embedding documents with OpenAIEmbeddings, indexing them in a vector store like FAISS, and querying with a gpt-4o chat model for classification. Use semantic search to match new documents to labeled examples and classify accordingly.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai langchain langchain_community faiss-cpu

Setup environment

Install required Python packages and set your OpenAI API key as an environment variable.

bash
pip install openai langchain langchain_community faiss-cpu

Step by step document classifier

This example shows how to embed documents, build a vector index, and classify new documents by semantic similarity using gpt-4o.

python
import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate

# Sample labeled documents
documents = [
    {"text": "Invoice from Acme Corp for $500", "label": "Invoice"},
    {"text": "Meeting notes from project kickoff", "label": "Notes"},
    {"text": "Annual financial report 2025", "label": "Report"}
]

# Extract texts and labels
texts = [doc["text"] for doc in documents]
labels = [doc["label"] for doc in documents]

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
vector_store = FAISS.from_texts(texts, embeddings)

# Initialize chat model
chat = ChatOpenAI(model_name="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])

# Function to classify a new document

def classify_document(text):
    # Embed and search for similar docs
    docs = vector_store.similarity_search(text, k=1)
    if not docs:
        return "Unknown"
    # Use the closest document's label as context
    context = docs[0].page_content
    prompt_template = """
You are a document classifier. Given the example document:
{example}

Classify the following document into one of the known categories:
{query}

Answer with only the category label.
"""
    prompt = ChatPromptTemplate.from_template(prompt_template)
    formatted_prompt = prompt.format_prompt(example=context, query=text)
    response = chat(formatted_prompt.to_messages())
    return response.content.strip()

# Test classification
new_doc = "Summary of quarterly earnings and expenses"
label = classify_document(new_doc)
print(f"Document: {new_doc}\nClassified as: {label}")
output
Document: Summary of quarterly earnings and expenses
Classified as: Report

Common variations

  • Use claude-3-5-sonnet-20241022 for potentially better classification accuracy.
  • Implement async calls with asyncio for high throughput.
  • Use streaming responses for real-time classification feedback.
  • Swap FAISS with Chroma or other vector stores depending on scale and persistence needs.

Troubleshooting tips

  • If classification returns "Unknown", increase k in similarity search or add more labeled examples.
  • Ensure your OpenAI API key is set correctly in os.environ["OPENAI_API_KEY"].
  • Check for rate limits or quota errors from the API and handle retries gracefully.
  • Validate that document texts are clean and representative of categories.

Key Takeaways

  • Use vector embeddings and semantic search to match new documents to labeled examples for classification.
  • Leverage gpt-4o or claude-3-5-sonnet-20241022 chat models to interpret and label documents contextually.
  • Start with a small labeled dataset and expand it to improve classification accuracy.
  • Choose vector stores like FAISS or Chroma based on your scale and persistence needs.
  • Handle API errors and environment setup carefully to ensure smooth classification workflows.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗