How to beginner · 3 min read

How to use pypdf for PDF processing

Quick answer
Use the pypdf library in Python to read, extract text, and manipulate PDF files easily. Install it via pip install pypdf, then open PDFs with PdfReader and access pages or metadata programmatically.

PREREQUISITES

  • Python 3.8+
  • pip install pypdf

Setup

Install the pypdf library using pip to enable PDF processing in Python.

bash
pip install pypdf

Step by step

This example demonstrates how to open a PDF file, extract text from each page, and print it.

python
from pypdf import PdfReader

# Load your PDF file
reader = PdfReader("sample.pdf")

# Get number of pages
num_pages = len(reader.pages)
print(f"Number of pages: {num_pages}")

# Extract text from each page
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    print(f"--- Page {i + 1} ---")
    print(text)
output
Number of pages: 3
--- Page 1 ---
This is the text content of page 1.
--- Page 2 ---
This is the text content of page 2.
--- Page 3 ---
This is the text content of page 3.

Common variations

You can also merge PDFs, extract metadata, or rotate pages using pypdf. For example, use PdfWriter to create or modify PDFs.

python
from pypdf import PdfReader, PdfWriter

# Merge two PDFs
reader1 = PdfReader("file1.pdf")
reader2 = PdfReader("file2.pdf")
writer = PdfWriter()

for page in reader1.pages:
    writer.add_page(page)
for page in reader2.pages:
    writer.add_page(page)

with open("merged.pdf", "wb") as f_out:
    writer.write(f_out)

# Rotate first page 90 degrees clockwise
page = reader1.pages[0]
page.rotate(90)

# Extract PDF metadata
metadata = reader1.metadata
print(metadata)
output
{'/Author': 'John Doe', '/Title': 'Sample PDF', '/Producer': 'pypdf'}

Troubleshooting

  • If extract_text() returns None, the PDF page might be scanned or image-based; consider OCR tools instead.
  • Ensure the PDF file path is correct to avoid FileNotFoundError.
  • For encrypted PDFs, use reader.decrypt(password) before accessing pages.

Key Takeaways

  • Use pypdf to read and extract text from PDFs easily in Python.
  • Manipulate PDFs by merging, rotating pages, or accessing metadata with PdfWriter and PdfReader.
  • Handle encrypted PDFs by decrypting before processing and use OCR for image-based PDFs where text extraction fails.
Verified 2026-04
Verify ↗