How to beginner · 3 min read

How to use PyPDF2 for PDF chunking

Quick answer
Use PyPDF2 to read a PDF file and extract text page by page or in groups of pages to create chunks. Iterate over pages with PdfReader, extract text using page.extract_text(), and combine pages into chunks as needed for processing.

PREREQUISITES

  • Python 3.8+
  • pip install PyPDF2>=3.0.0

Setup

Install PyPDF2 via pip to handle PDF reading and text extraction.

bash
pip install PyPDF2>=3.0.0

Step by step

This example reads a PDF file, extracts text page by page, and groups pages into chunks of a specified size.

python
from PyPDF2 import PdfReader

# Function to chunk PDF text by pages

def pdf_chunker(file_path, chunk_size=5):
    reader = PdfReader(file_path)
    num_pages = len(reader.pages)
    chunks = []
    for i in range(0, num_pages, chunk_size):
        chunk_text = []
        for j in range(i, min(i + chunk_size, num_pages)):
            page = reader.pages[j]
            text = page.extract_text() or ""
            chunk_text.append(text)
        chunks.append("\n".join(chunk_text))
    return chunks

# Example usage
if __name__ == "__main__":
    pdf_path = "sample.pdf"  # Replace with your PDF file path
    chunks = pdf_chunker(pdf_path, chunk_size=3)
    for idx, chunk in enumerate(chunks):
        print(f"Chunk {idx + 1} (length {len(chunk)} chars):")
        print(chunk[:500])  # Print first 500 chars of chunk
        print("---")
output
Chunk 1 (length 2345 chars):
[First 500 characters of chunk 1 text]
---
Chunk 2 (length 1987 chars):
[First 500 characters of chunk 2 text]
---
... (and so on)

Common variations

  • Adjust chunk_size to control how many pages per chunk.
  • Use page.extract_text() carefully; some PDFs may require OCR or other libraries for better text extraction.
  • Combine with AI APIs by sending chunks as separate prompts for processing or embedding.

Troubleshooting

  • If extract_text() returns None, the PDF page might be scanned or image-based; consider using OCR tools like pytesseract.
  • Ensure the PDF file path is correct to avoid FileNotFoundError.
  • For very large PDFs, process chunks lazily or stream to avoid memory issues.

Key Takeaways

  • Use PyPDF2.PdfReader to read and extract text from PDF pages.
  • Chunk PDFs by grouping multiple pages' text for manageable processing.
  • Adjust chunk size based on your application's memory and processing needs.
Verified 2026-04
Verify ↗