How to use pdfplumber in Python
Direct answer
Use the
pdfplumber library in Python to open PDF files and extract text or tables by iterating over pages and calling methods like page.extract_text() or page.extract_table().Setup
Install
pip install pdfplumber Imports
import pdfplumber
import os Examples
inExtract text from a single-page PDF named 'sample.pdf'.
outExtracted text printed from the first page of 'sample.pdf'.
inExtract all text from a multi-page PDF 'report.pdf'.
outConcatenated text from all pages of 'report.pdf' printed.
inExtract tables from 'data.pdf' and print them as lists.
outList of tables extracted from each page printed.
Integration steps
- Install pdfplumber via pip.
- Import pdfplumber in your Python script.
- Open the PDF file using pdfplumber.open().
- Iterate over pages to extract text or tables.
- Process or print the extracted content.
- Close the PDF file after extraction.
Full code
import pdfplumber
pdf_path = "sample.pdf"
with pdfplumber.open(pdf_path) as pdf:
# Extract text from all pages
full_text = ""
for page in pdf.pages:
text = page.extract_text()
if text:
full_text += text + "\n"
print("Extracted Text:")
print(full_text)
# Example: Extract tables from first page
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
tables = first_page.extract_tables()
print("Extracted Tables:")
for table in tables:
for row in table:
print(row) output
Extracted Text: This is the text content extracted from sample.pdf. Extracted Tables: ['Header1', 'Header2', 'Header3'] ['Row1Col1', 'Row1Col2', 'Row1Col3'] ['Row2Col1', 'Row2Col2', 'Row2Col3']
API trace
Request
No API request; pdfplumber is a local Python library that reads PDF files directly. Response
Returns Python objects: page objects with methods like extract_text() returning strings, extract_tables() returning lists of lists. Extract
Call page.extract_text() for text or page.extract_tables() for tables on each pdfplumber page object.Variants
Extract text from a specific page only ›
When you only need text from a specific page instead of the entire document.
import pdfplumber
pdf_path = "sample.pdf"
page_number = 2 # zero-based index
with pdfplumber.open(pdf_path) as pdf:
if page_number < len(pdf.pages):
page = pdf.pages[page_number]
text = page.extract_text()
print(f"Text from page {page_number + 1}:")
print(text)
else:
print("Page number out of range.") Extract tables from all pages ›
When you want to extract and process tables from every page in a PDF.
import pdfplumber
pdf_path = "data.pdf"
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
print(f"Tables on page {i + 1}:")
for table in tables:
for row in table:
print(row)
print("---") Extract text with layout information ›
When you need detailed character-level layout or position data from the PDF.
import pdfplumber
pdf_path = "sample.pdf"
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[0]
chars = page.chars # list of character dicts with position info
for char in chars:
print(f"Char: {char['text']} at ({char['x0']}, {char['top']})") Performance
Latency~100-500ms per page depending on PDF complexity and system speed
CostFree, open-source library with no API usage costs
Rate limitsNone, runs locally without network calls
- Extract only needed pages to reduce processing time.
- Avoid extracting images or complex objects if only text is needed.
- Cache extracted text if processing the same PDF multiple times.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Full document text extraction | ~300ms per page | Free | Complete text extraction |
| Single page extraction | ~100ms | Free | Quick access to specific page text |
| Table extraction | ~400ms per page | Free | Extracting structured tables from PDFs |
Quick tip
Use <code>with pdfplumber.open()</code> context manager to ensure files are properly closed after extraction.
Common mistake
Forgetting to check if <code>page.extract_text()</code> returns None, which happens if the page has no extractable text.