How to extract structured data from PDF with LLM
Quick answer
Use a PDF parser like
PyPDF2 or pdfplumber to extract raw text from PDFs, then send the text to an LLM such as gpt-4o via the OpenAI SDK to extract structured data by prompting the model with extraction instructions. This approach enables converting unstructured PDF content into JSON or other structured formats.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 pdfplumber
Setup
Install the required Python packages for PDF parsing and OpenAI API access. Set your OpenAI API key as an environment variable.
pip install openai pdfplumber Step by step
This example extracts text from a PDF using pdfplumber and sends it to OpenAI's gpt-4o model to extract structured JSON data such as names and dates.
import os
import pdfplumber
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Extract text from PDF
with pdfplumber.open("sample.pdf") as pdf:
full_text = "".join(page.extract_text() or "" for page in pdf.pages)
# Prepare prompt for structured extraction
prompt = f"Extract the following structured data as JSON from the text below:\n- Name\n- Date of birth\n- Address\nText:\n{full_text}\n\nJSON:"
# Call OpenAI chat completion
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Extract JSON from response
structured_data = response.choices[0].message.content
print(structured_data) output
{
"Name": "John Doe",
"Date of birth": "1990-01-01",
"Address": "123 Main St, Anytown, USA"
} Common variations
- Use
PyPDF2instead ofpdfplumberfor PDF text extraction. - Use other LLMs like
claude-3-5-sonnet-20241022with the Anthropic SDK for extraction. - For large PDFs, chunk text and extract data in batches to avoid token limits.
- Use async calls with OpenAI SDK for improved throughput.
Troubleshooting
- If the extracted text is empty, verify the PDF is not scanned images; use OCR tools if needed.
- If the LLM output is not valid JSON, add explicit instructions in the prompt to respond only with JSON.
- Check your
OPENAI_API_KEYenvironment variable if authentication errors occur.
Key Takeaways
- Use
pdfplumberor similar libraries to extract raw text from PDFs before sending to LLMs. - Prompt LLMs like
gpt-4oexplicitly to output structured JSON for reliable data extraction. - Chunk large documents to stay within token limits and improve extraction accuracy.
- Validate and sanitize LLM output to ensure it matches expected structured formats.
- Set API keys securely via environment variables to avoid credential leaks.