How to Intermediate · 4 min read

How to extract structured data from PDF with LLM

Q: How to extract structured data from PDF with LLM

Use a PDF parser like PyPDF2 or pdfplumber to extract raw text from PDFs, then send the text to an LLM such as gpt-4o via the OpenAI SDK to extract structured data by prompting the model with extraction instructions. This approach enables converting unstructured PDF content into JSON or other structured formats.

Quick answer

Use a PDF parser like PyPDF2 or pdfplumber to extract raw text from PDFs, then send the text to an LLM such as gpt-4o via the OpenAI SDK to extract structured data by prompting the model with extraction instructions. This approach enables converting unstructured PDF content into JSON or other structured formats.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 pdfplumber

Setup

Install the required Python packages for PDF parsing and OpenAI API access. Set your OpenAI API key as an environment variable.

bash

pip install openai pdfplumber

Step by step

This example extracts text from a PDF using pdfplumber and sends it to OpenAI's gpt-4o model to extract structured JSON data such as names and dates.

python

import os
import pdfplumber
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Extract text from PDF
with pdfplumber.open("sample.pdf") as pdf:
    full_text = "".join(page.extract_text() or "" for page in pdf.pages)

# Prepare prompt for structured extraction
prompt = f"Extract the following structured data as JSON from the text below:\n- Name\n- Date of birth\n- Address\nText:\n{full_text}\n\nJSON:" 

# Call OpenAI chat completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

# Extract JSON from response
structured_data = response.choices[0].message.content
print(structured_data)

output

{
  "Name": "John Doe",
  "Date of birth": "1990-01-01",
  "Address": "123 Main St, Anytown, USA"
}

Common variations

Use PyPDF2 instead of pdfplumber for PDF text extraction.
Use other LLMs like claude-3-5-sonnet-20241022 with the Anthropic SDK for extraction.
For large PDFs, chunk text and extract data in batches to avoid token limits.
Use async calls with OpenAI SDK for improved throughput.

Troubleshooting

If the extracted text is empty, verify the PDF is not scanned images; use OCR tools if needed.
If the LLM output is not valid JSON, add explicit instructions in the prompt to respond only with JSON.
Check your OPENAI_API_KEY environment variable if authentication errors occur.

✅

Key Takeaways

Use pdfplumber or similar libraries to extract raw text from PDFs before sending to LLMs.
Prompt LLMs like gpt-4o explicitly to output structured JSON for reliable data extraction.
Chunk large documents to stay within token limits and improve extraction accuracy.
Validate and sanitize LLM output to ensure it matches expected structured formats.
Set API keys securely via environment variables to avoid credential leaks.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗