How to Intermediate · 3 min read

How to use AI to extract information from invoices

Quick answer
Use a combination of OCR tools to convert invoice images or PDFs into text, then apply a large language model like gpt-4o to parse and extract structured information such as invoice number, date, vendor, and totals. This approach leverages OCR for text extraction and LLMs for semantic understanding and data structuring.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 pytesseract pdf2image Pillow

Setup

Install necessary Python packages for OCR and OpenAI API interaction. Set your OPENAI_API_KEY as an environment variable.

bash
pip install openai pytesseract pdf2image Pillow

Step by step

This example converts a PDF invoice to text using OCR, then sends the text to gpt-4o to extract key invoice fields in JSON format.

python
import os
from openai import OpenAI
from pdf2image import convert_from_path
import pytesseract

# Convert PDF invoice pages to images
images = convert_from_path('invoice.pdf')

# Extract text from all pages
invoice_text = ''
for img in images:
    invoice_text += pytesseract.image_to_string(img)

# Initialize OpenAI client
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

# Prompt to extract structured invoice data
prompt = f"""
Extract the following fields from the invoice text below as JSON:
- Invoice Number
- Invoice Date
- Vendor Name
- Total Amount

Invoice Text:
{invoice_text}
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_data = response.choices[0].message.content
print(extracted_data)
output
{
  "Invoice Number": "INV-12345",
  "Invoice Date": "2026-03-15",
  "Vendor Name": "Acme Supplies",
  "Total Amount": "$1,234.56"
}

Common variations

  • Use claude-3-5-sonnet-20241022 for higher accuracy on complex invoices.
  • Process images directly if invoices are scanned photos.
  • Implement async calls with OpenAI SDK for batch processing.
  • Use specialized invoice parsing libraries like invoice2data combined with LLMs for hybrid extraction.

Troubleshooting

  • If OCR text is garbled, improve image quality or adjust pytesseract config.
  • If extracted JSON is incomplete, refine the prompt to be more explicit.
  • Check API key and usage limits if requests fail.
  • For multi-language invoices, specify language in OCR and prompt.

Key Takeaways

  • Combine OCR tools with LLMs like gpt-4o for effective invoice data extraction.
  • Use clear, structured prompts to guide the model to output JSON with required fields.
  • Improve accuracy by preprocessing images and refining prompts based on invoice format.
  • Consider alternative models like claude-3-5-sonnet-20241022 for complex documents.
  • Handle errors by validating OCR output and monitoring API usage.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗