How to Intermediate · 4 min read

How to build AI-powered document classifier

Quick answer

Build an AI-powered document classifier by embedding documents with OpenAIEmbeddings, indexing them in a vector store like FAISS, and querying with a gpt-4o chat model for classification. Use semantic search to match new documents to labeled examples and classify accordingly.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai langchain langchain_community faiss-cpu

Setup environment

Install required Python packages and set your OpenAI API key as an environment variable.

bash

pip install openai langchain langchain_community faiss-cpu

Step by step document classifier

This example shows how to embed documents, build a vector index, and classify new documents by semantic similarity using gpt-4o.

python

import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate

# Sample labeled documents
documents = [
    {"text": "Invoice from Acme Corp for $500", "label": "Invoice"},
    {"text": "Meeting notes from project kickoff", "label": "Notes"},
    {"text": "Annual financial report 2025", "label": "Report"}
]

# Extract texts and labels
texts = [doc["text"] for doc in documents]
labels = [doc["label"] for doc in documents]

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
vector_store = FAISS.from_texts(texts, embeddings)

# Initialize chat model
chat = ChatOpenAI(model_name="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])

# Function to classify a new document

def classify_document(text):
    # Embed and search for similar docs
    docs = vector_store.similarity_search(text, k=1)
    if not docs:
        return "Unknown"
    # Use the closest document's label as context
    context = docs[0].page_content
    prompt_template = """
You are a document classifier. Given the example document:
{example}

Classify the following document into one of the known categories:
{query}

Answer with only the category label.
"""
    prompt = ChatPromptTemplate.from_template(prompt_template)
    formatted_prompt = prompt.format_prompt(example=context, query=text)
    response = chat(formatted_prompt.to_messages())
    return response.content.strip()

# Test classification
new_doc = "Summary of quarterly earnings and expenses"
label = classify_document(new_doc)
print(f"Document: {new_doc}\nClassified as: {label}")

output

Document: Summary of quarterly earnings and expenses
Classified as: Report

Common variations

Use claude-3-5-sonnet-20241022 for potentially better classification accuracy.
Implement async calls with asyncio for high throughput.
Use streaming responses for real-time classification feedback.
Swap FAISS with Chroma or other vector stores depending on scale and persistence needs.

Troubleshooting tips

If classification returns "Unknown", increase k in similarity search or add more labeled examples.
Ensure your OpenAI API key is set correctly in os.environ["OPENAI_API_KEY"].
Check for rate limits or quota errors from the API and handle retries gracefully.
Validate that document texts are clean and representative of categories.

✅

Key Takeaways

Use vector embeddings and semantic search to match new documents to labeled examples for classification.
Leverage gpt-4o or claude-3-5-sonnet-20241022 chat models to interpret and label documents contextually.
Start with a small labeled dataset and expand it to improve classification accuracy.
Choose vector stores like FAISS or Chroma based on your scale and persistence needs.
Handle API errors and environment setup carefully to ensure smooth classification workflows.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗