How to extract tables from PDF with Python
Direct answer
Use Python libraries like
PyMuPDF or pdfplumber for local extraction, or AI APIs such as OpenAI with PDF parsing tools to extract tables from PDFs programmatically.Setup
Install
pip install pdfplumber openai Env vars
OPENAI_API_KEY Imports
import os
import pdfplumber
from openai import OpenAI Examples
inExtract tables from a simple one-page PDF with a single table.
out[["Name", "Age"], ["Alice", "30"], ["Bob", "25"]]
inExtract tables from a multi-page PDF with multiple tables on different pages.
out[[["Product", "Price"], ["Book", "$10"], ["Pen", "$2"]], [["City", "Population"], ["NYC", "8M"], ["LA", "4M"]]]
inExtract tables from a scanned PDF using AI OCR and table detection.
out[["Date", "Event"], ["2026-04-01", "Conference"], ["2026-04-02", "Workshop"]]
Integration steps
- Install
pdfplumberandopenaiPython packages. - Load the PDF file locally using
pdfplumberto extract raw table data. - Optionally, send extracted text or images to an AI API like
OpenAIfor enhanced table parsing or OCR. - Parse the API response to structure the table data into Python lists or dataframes.
- Use or save the extracted table data as needed.
Full code
import os
import pdfplumber
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pdf_path = "sample_tables.pdf"
# Extract tables locally using pdfplumber
with pdfplumber.open(pdf_path) as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
all_tables.append(table)
print("Extracted tables locally:")
for i, table in enumerate(all_tables, 1):
print(f"Table {i}:")
for row in table:
print(row)
# Example: Use OpenAI to parse a table text snippet (optional enhancement)
if all_tables:
# Convert first table to text for AI parsing
table_text = "\n".join(["\t".join(row) for row in all_tables[0] if row])
prompt = f"Extract the table data as JSON from the following text:\n{table_text}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
ai_parsed = response.choices[0].message.content
print("\nAI parsed table JSON:")
print(ai_parsed) output
Extracted tables locally:
Table 1:
['Name', 'Age']
['Alice', '30']
['Bob', '25']
AI parsed table JSON:
[
{"Name": "Alice", "Age": 30},
{"Name": "Bob", "Age": 25}
] API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Extract the table data as JSON from the following text:\nName\tAge\nAlice\t30\nBob\t25"}]} Response
{"choices": [{"message": {"content": "[ {\"Name\": \"Alice\", \"Age\": 30}, {\"Name\": \"Bob\", \"Age\": 25} ]"}}], "usage": {"total_tokens": 50}} Extract
response.choices[0].message.contentVariants
Streaming AI Table Parsing ›
Use streaming when parsing large tables or when you want to display partial results as they arrive for better UX.
import os
import pdfplumber
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pdf_path = "sample_tables.pdf"
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
tables = first_page.extract_tables()
if tables:
table_text = "\n".join(["\t".join(row) for row in tables[0] if row])
prompt = f"Extract the table data as JSON from the following text:\n{table_text}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
print("Streaming AI parsed table JSON:")
for chunk in response:
print(chunk.choices[0].delta.get('content', ''), end='')
print() Async Table Extraction with OpenAI ›
Use async when integrating table extraction into an asynchronous application or web server for concurrency.
import os
import asyncio
import pdfplumber
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def extract_table_async(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
tables = pdf.pages[0].extract_tables()
if not tables:
return None
table_text = "\n".join(["\t".join(row) for row in tables[0] if row])
prompt = f"Extract the table data as JSON from the following text:\n{table_text}"
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
result = await extract_table_async("sample_tables.pdf")
print("Async AI parsed table JSON:")
print(result)
asyncio.run(main()) Using Google Vertex AI for Table Extraction ›
Use Google Vertex AI if you prefer Google Cloud services and want to leverage Gemini models for table parsing.
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.0-flash")
# Assume text extracted from PDF locally
table_text = "Name\tAge\nAlice\t30\nBob\t25"
prompt = f"Extract the table data as JSON from the following text:\n{table_text}"
response = model.generate_content(prompt)
print("Vertex AI parsed table JSON:")
print(response.text) Performance
Latency~1-3 seconds per page for local extraction; ~1-2 seconds per API call for AI parsing
Cost~$0.002 per 500 tokens for OpenAI <code>gpt-4o</code> calls
Rate limitsOpenAI default tier: 350 RPM / 90K TPM
- Extract only relevant table text before sending to AI to reduce tokens.
- Use concise prompts focused on table extraction.
- Cache repeated table extraction results to avoid redundant API calls.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Local pdfplumber extraction | ~1-3s per page | Free | Quick extraction of simple tables |
| OpenAI GPT-4o parsing | ~1-2s per call | ~$0.002 per 500 tokens | Complex table parsing and OCR enhancement |
| Google Vertex AI Gemini | ~1-2s per call | Check Google pricing | Cloud-native AI parsing with Gemini models |
Quick tip
Use <code>pdfplumber</code> to extract raw tables locally before sending to an AI API for cleaner, more accurate table parsing.
Common mistake
Beginners often try to parse tables directly from raw PDF text without using a PDF parsing library, resulting in messy or incomplete data.