How to Intermediate · 4 min read

Financial document extraction with AI

Quick answer
Use a combination of OCR tools to convert financial documents into text and LLM models like gpt-4o to extract structured data such as invoices, receipts, or balance sheets. This approach enables automated parsing, classification, and data extraction from PDFs or images with high accuracy.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install pdfplumber or pytesseract (for OCR)

Setup

Install the necessary Python packages for OCR and OpenAI API access. Set your OpenAI API key as an environment variable for secure authentication.

bash
pip install openai pdfplumber pytesseract
output
Collecting openai\nCollecting pdfplumber\nCollecting pytesseract\nSuccessfully installed openai pdfplumber pytesseract

Step by step

This example extracts text from a PDF financial document using pdfplumber and then uses gpt-4o to parse key financial fields like invoice number, date, and total amount.

python
import os
import pdfplumber
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Extract text from PDF
with pdfplumber.open("financial_invoice.pdf") as pdf:
    text = "".join(page.extract_text() or "" for page in pdf.pages)

# Prompt to extract structured data
prompt = f"Extract invoice number, date, and total amount from the following text:\n\n{text}\n\nRespond in JSON format with keys: invoice_number, date, total_amount." 

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print("Extracted data:", response.choices[0].message.content)
output
Extracted data: {
  "invoice_number": "INV-12345",
  "date": "2026-03-15",
  "total_amount": "$1,250.00"
}

Common variations

You can use pytesseract for OCR on scanned images instead of PDFs. For asynchronous workflows, use async with the OpenAI SDK. Different models like claude-3-5-haiku-20241022 or gemini-2.0-flash can be used depending on accuracy and cost preferences.

python
import os
import pytesseract
from PIL import Image
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# OCR from image
image = Image.open("invoice_scan.png")
text = pytesseract.image_to_string(image)

prompt = f"Extract invoice number, date, and total amount from the text below in JSON format:\n\n{text}"

response = client.chat.completions.create(
    model="claude-3-5-haiku-20241022",
    messages=[{"role": "user", "content": prompt}]
)

print("Extracted data:", response.choices[0].message.content)
output
Extracted data: {
  "invoice_number": "INV-67890",
  "date": "2026-04-01",
  "total_amount": "$2,340.50"
}

Troubleshooting

  • If text extraction from PDFs is empty, verify the PDF is not scanned image-only; use OCR instead.
  • If the AI output is incomplete or incorrect, refine the prompt with examples or increase max_tokens.
  • Ensure your OpenAI API key is set correctly in the environment variable OPENAI_API_KEY.

Key Takeaways

  • Combine OCR tools like pdfplumber or pytesseract with LLMs for accurate financial data extraction.
  • Use structured prompts requesting JSON output to simplify parsing of extracted data.
  • Choose models like gpt-4o or claude-3-5-haiku-20241022 based on your accuracy and cost needs.
  • Validate document type before extraction to select the right text or image processing method.
  • Set environment variables securely and handle API errors with prompt tuning or token limits.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022, gemini-2.0-flash
Verify ↗