How to beginner · 3 min read

How to use PyPDF2 for PDF extraction

Q: How to use PyPDF2 for PDF extraction

Use PyPDF2 to extract text from PDFs by opening the file in binary mode, creating a PdfReader object, and iterating through pages to call extract_text(). This method works for most text-based PDFs and is simple to implement in Python.

Quick answer

Use PyPDF2 to extract text from PDFs by opening the file in binary mode, creating a PdfReader object, and iterating through pages to call extract_text(). This method works for most text-based PDFs and is simple to implement in Python.

PREREQUISITES

Python 3.8+
pip install PyPDF2>=3.0

Setup

Install PyPDF2 via pip to handle PDF files in Python. Ensure you have Python 3.8 or newer.

bash

pip install PyPDF2

Step by step

This example demonstrates how to open a PDF file, read all pages, and extract text using PyPDF2.

python

from PyPDF2 import PdfReader

# Path to your PDF file
pdf_path = "sample.pdf"

# Open the PDF file in binary mode
with open(pdf_path, "rb") as file:
    reader = PdfReader(file)
    text = ""
    # Iterate through all pages
    for page in reader.pages:
        text += page.extract_text() or ""

print(text)

output

Contents of the PDF printed as plain text

Common variations

Use extract_text() on individual pages for selective extraction.
Combine with PyPDF2.PdfWriter to extract and save specific pages.
For encrypted PDFs, use reader.decrypt(password) before extraction.

python

from PyPDF2 import PdfReader, PdfWriter

pdf_path = "sample.pdf"
output_path = "extracted_page.pdf"

with open(pdf_path, "rb") as file:
    reader = PdfReader(file)
    writer = PdfWriter()
    # Extract only the first page
    first_page = reader.pages[0]
    writer.add_page(first_page)

    with open(output_path, "wb") as output_file:
        writer.write(output_file)

print(f"First page saved to {output_path}")

output

First page saved to extracted_page.pdf

Troubleshooting

If extract_text() returns None, the PDF might be scanned or image-based; consider OCR tools like pytesseract.
For encrypted PDFs, ensure you provide the correct password with reader.decrypt().
Check that the PDF file is not corrupted or locked by another process.

✅

Key Takeaways

Use PyPDF2.PdfReader to open and read PDF files in Python.
Extract text page-by-page with extract_text() for reliable results.
Handle encrypted PDFs by decrypting before extraction.
For scanned PDFs, use OCR instead of PyPDF2.
You can extract and save specific pages using PdfWriter.

Verified 2026-04

Verify ↗