Code intermediate · 3 min read

How to extract tables from PDF with Python

Q: How to extract tables from PDF with Python

Use Python libraries like PyMuPDF or pdfplumber for local extraction, or AI APIs such as OpenAI with PDF parsing tools to extract tables from PDFs programmatically.

Direct answer

Use Python libraries like PyMuPDF or pdfplumber for local extraction, or AI APIs such as OpenAI with PDF parsing tools to extract tables from PDFs programmatically.

Setup

Install

bash

pip install pdfplumber openai

Env vars

OPENAI_API_KEY

Imports

python

import os
import pdfplumber
from openai import OpenAI

Examples

inExtract tables from a simple one-page PDF with a single table.

out[["Name", "Age"], ["Alice", "30"], ["Bob", "25"]]

inExtract tables from a multi-page PDF with multiple tables on different pages.

out[[["Product", "Price"], ["Book", "$10"], ["Pen", "$2"]], [["City", "Population"], ["NYC", "8M"], ["LA", "4M"]]]

inExtract tables from a scanned PDF using AI OCR and table detection.

out[["Date", "Event"], ["2026-04-01", "Conference"], ["2026-04-02", "Workshop"]]

Integration steps

Install pdfplumber and openai Python packages.
Load the PDF file locally using pdfplumber to extract raw table data.
Optionally, send extracted text or images to an AI API like OpenAI for enhanced table parsing or OCR.
Parse the API response to structure the table data into Python lists or dataframes.
Use or save the extracted table data as needed.

Full code

python

import os
import pdfplumber
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

pdf_path = "sample_tables.pdf"

# Extract tables locally using pdfplumber
with pdfplumber.open(pdf_path) as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            all_tables.append(table)

print("Extracted tables locally:")
for i, table in enumerate(all_tables, 1):
    print(f"Table {i}:")
    for row in table:
        print(row)

# Example: Use OpenAI to parse a table text snippet (optional enhancement)
if all_tables:
    # Convert first table to text for AI parsing
    table_text = "\n".join(["\t".join(row) for row in all_tables[0] if row])
    prompt = f"Extract the table data as JSON from the following text:\n{table_text}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    ai_parsed = response.choices[0].message.content
    print("\nAI parsed table JSON:")
    print(ai_parsed)

output

Extracted tables locally:
Table 1:
['Name', 'Age']
['Alice', '30']
['Bob', '25']

AI parsed table JSON:
[
  {"Name": "Alice", "Age": 30},
  {"Name": "Bob", "Age": 25}
]

API trace

Request

json

{"model": "gpt-4o", "messages": [{"role": "user", "content": "Extract the table data as JSON from the following text:\nName\tAge\nAlice\t30\nBob\t25"}]}

Response

json

{"choices": [{"message": {"content": "[ {\"Name\": \"Alice\", \"Age\": 30}, {\"Name\": \"Bob\", \"Age\": 25} ]"}}], "usage": {"total_tokens": 50}}

Extractresponse.choices[0].message.content

Variants

Streaming AI Table Parsing ›

Use streaming when parsing large tables or when you want to display partial results as they arrive for better UX.

python

import os
import pdfplumber
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

pdf_path = "sample_tables.pdf"

with pdfplumber.open(pdf_path) as pdf:
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()

if tables:
    table_text = "\n".join(["\t".join(row) for row in tables[0] if row])
    prompt = f"Extract the table data as JSON from the following text:\n{table_text}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    print("Streaming AI parsed table JSON:")
    for chunk in response:
        print(chunk.choices[0].delta.get('content', ''), end='')
    print()

Async Table Extraction with OpenAI ›

Use async when integrating table extraction into an asynchronous application or web server for concurrency.

python

import os
import asyncio
import pdfplumber
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def extract_table_async(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        tables = pdf.pages[0].extract_tables()

    if not tables:
        return None

    table_text = "\n".join(["\t".join(row) for row in tables[0] if row])
    prompt = f"Extract the table data as JSON from the following text:\n{table_text}"

    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    result = await extract_table_async("sample_tables.pdf")
    print("Async AI parsed table JSON:")
    print(result)

asyncio.run(main())

Using Google Vertex AI for Table Extraction ›

Use Google Vertex AI if you prefer Google Cloud services and want to leverage Gemini models for table parsing.

python

import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.0-flash")

# Assume text extracted from PDF locally
table_text = "Name\tAge\nAlice\t30\nBob\t25"
prompt = f"Extract the table data as JSON from the following text:\n{table_text}"

response = model.generate_content(prompt)
print("Vertex AI parsed table JSON:")
print(response.text)

Performance

Latency~1-3 seconds per page for local extraction; ~1-2 seconds per API call for AI parsing

Cost~$0.002 per 500 tokens for OpenAI <code>gpt-4o</code> calls

Rate limitsOpenAI default tier: 350 RPM / 90K TPM

Extract only relevant table text before sending to AI to reduce tokens.
Use concise prompts focused on table extraction.
Cache repeated table extraction results to avoid redundant API calls.

Approach	Latency	Cost/call	Best for
Local pdfplumber extraction	~1-3s per page	Free	Quick extraction of simple tables
OpenAI GPT-4o parsing	~1-2s per call	~$0.002 per 500 tokens	Complex table parsing and OCR enhancement
Google Vertex AI Gemini	~1-2s per call	Check Google pricing	Cloud-native AI parsing with Gemini models

✓

Quick tip

Use <code>pdfplumber</code> to extract raw tables locally before sending to an AI API for cleaner, more accurate table parsing.

⚠

Common mistake

Beginners often try to parse tables directly from raw PDF text without using a PDF parsing library, resulting in messy or incomplete data.

Verified 2026-04 · gpt-4o, gemini-2.0-flash

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.