How to Intermediate · 3 min read

How to use AI to extract information from invoices

Quick answer

Use a combination of OCR tools to convert invoice images or PDFs into text, then apply a large language model like gpt-4o to parse and extract structured information such as invoice number, date, vendor, and totals. This approach leverages OCR for text extraction and LLMs for semantic understanding and data structuring.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 pytesseract pdf2image Pillow

Setup

Install necessary Python packages for OCR and OpenAI API interaction. Set your OPENAI_API_KEY as an environment variable.

bash

pip install openai pytesseract pdf2image Pillow

Step by step

This example converts a PDF invoice to text using OCR, then sends the text to gpt-4o to extract key invoice fields in JSON format.

python

import os
from openai import OpenAI
from pdf2image import convert_from_path
import pytesseract

# Convert PDF invoice pages to images
images = convert_from_path('invoice.pdf')

# Extract text from all pages
invoice_text = ''
for img in images:
    invoice_text += pytesseract.image_to_string(img)

# Initialize OpenAI client
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

# Prompt to extract structured invoice data
prompt = f"""
Extract the following fields from the invoice text below as JSON:
- Invoice Number
- Invoice Date
- Vendor Name
- Total Amount

Invoice Text:
{invoice_text}
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_data = response.choices[0].message.content
print(extracted_data)

output

{
  "Invoice Number": "INV-12345",
  "Invoice Date": "2026-03-15",
  "Vendor Name": "Acme Supplies",
  "Total Amount": "$1,234.56"
}

Common variations

Use claude-3-5-sonnet-20241022 for higher accuracy on complex invoices.
Process images directly if invoices are scanned photos.
Implement async calls with OpenAI SDK for batch processing.
Use specialized invoice parsing libraries like invoice2data combined with LLMs for hybrid extraction.

Troubleshooting

If OCR text is garbled, improve image quality or adjust pytesseract config.
If extracted JSON is incomplete, refine the prompt to be more explicit.
Check API key and usage limits if requests fail.
For multi-language invoices, specify language in OCR and prompt.

Key Takeaways

Combine OCR tools with LLMs like gpt-4o for effective invoice data extraction.
Use clear, structured prompts to guide the model to output JSON with required fields.
Improve accuracy by preprocessing images and refining prompts based on invoice format.
Consider alternative models like claude-3-5-sonnet-20241022 for complex documents.
Handle errors by validating OCR output and monitoring API usage.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.