How to Intermediate · 4 min read

How to use AI to extract data from PDFs

Quick answer

Use AI to extract data from PDFs by first converting PDF content into text using document loaders like PyPDFLoader, then process the extracted text with an LLM such as gpt-4o to parse and structure the data. This approach combines OCR/text extraction with AI-powered understanding for accurate data retrieval.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 langchain langchain_community

Setup

Install necessary Python packages and set your OpenAI API key as an environment variable.

bash

pip install openai langchain langchain_community

Step by step

This example loads a PDF, extracts text, and uses gpt-4o to extract structured data like invoice numbers or dates.

python

import os
from openai import OpenAI
from langchain_community.document_loaders import PyPDFLoader

# Load PDF and extract text
loader = PyPDFLoader("sample_invoice.pdf")
docs = loader.load()
text = "\n".join([doc.page_content for doc in docs])

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Prompt to extract structured data
prompt = f"Extract invoice number, date, and total amount from the following text:\n\n{text}\n\nRespond in JSON format with keys: invoice_number, date, total_amount." 

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

output

{
  "invoice_number": "INV-12345",
  "date": "2026-03-15",
  "total_amount": "$1,234.56"
}

Common variations

Use PyPDFLoader with OCR tools like pytesseract for scanned PDFs.
Switch to claude-3-5-sonnet-20241022 for improved code and data extraction accuracy.
Use async API calls for large documents to improve throughput.

Troubleshooting

If text extraction returns empty, verify PDF is not scanned image-only; use OCR preprocessing.
If AI output is incomplete, increase max_tokens or simplify prompt.
Check API key and environment variable setup if authentication errors occur.

✅

Key Takeaways

Combine PDF text extraction with LLMs for accurate data parsing.
Use document loaders like PyPDFLoader to handle PDF content efficiently.
Choose models like gpt-4o or claude-3-5-sonnet-20241022 for best extraction results.
Preprocess scanned PDFs with OCR before AI processing.
Adjust prompts and token limits to optimize extraction quality.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗