How to use PyPDF2 for PDF extraction
Quick answer
Use
PyPDF2 to extract text from PDFs by opening the file in binary mode, creating a PdfReader object, and iterating through pages to call extract_text(). This method works for most text-based PDFs and is simple to implement in Python.PREREQUISITES
Python 3.8+pip install PyPDF2>=3.0
Setup
Install PyPDF2 via pip to handle PDF files in Python. Ensure you have Python 3.8 or newer.
pip install PyPDF2 Step by step
This example demonstrates how to open a PDF file, read all pages, and extract text using PyPDF2.
from PyPDF2 import PdfReader
# Path to your PDF file
pdf_path = "sample.pdf"
# Open the PDF file in binary mode
with open(pdf_path, "rb") as file:
reader = PdfReader(file)
text = ""
# Iterate through all pages
for page in reader.pages:
text += page.extract_text() or ""
print(text) output
Contents of the PDF printed as plain text
Common variations
- Use
extract_text()on individual pages for selective extraction. - Combine with
PyPDF2.PdfWriterto extract and save specific pages. - For encrypted PDFs, use
reader.decrypt(password)before extraction.
from PyPDF2 import PdfReader, PdfWriter
pdf_path = "sample.pdf"
output_path = "extracted_page.pdf"
with open(pdf_path, "rb") as file:
reader = PdfReader(file)
writer = PdfWriter()
# Extract only the first page
first_page = reader.pages[0]
writer.add_page(first_page)
with open(output_path, "wb") as output_file:
writer.write(output_file)
print(f"First page saved to {output_path}") output
First page saved to extracted_page.pdf
Troubleshooting
- If
extract_text()returnsNone, the PDF might be scanned or image-based; consider OCR tools likepytesseract. - For encrypted PDFs, ensure you provide the correct password with
reader.decrypt(). - Check that the PDF file is not corrupted or locked by another process.
Key Takeaways
- Use
PyPDF2.PdfReaderto open and read PDF files in Python. - Extract text page-by-page with
extract_text()for reliable results. - Handle encrypted PDFs by decrypting before extraction.
- For scanned PDFs, use OCR instead of
PyPDF2. - You can extract and save specific pages using
PdfWriter.