How to beginner · 3 min read

How to use PyPDF2 for PDF chunking

Q: How to use PyPDF2 for PDF chunking

Use PyPDF2 to read a PDF file and extract text page by page or in groups of pages to create chunks. Iterate over pages with PdfReader, extract text using page.extract_text(), and combine pages into chunks as needed for processing.

Quick answer

Use PyPDF2 to read a PDF file and extract text page by page or in groups of pages to create chunks. Iterate over pages with PdfReader, extract text using page.extract_text(), and combine pages into chunks as needed for processing.

PREREQUISITES

Python 3.8+
pip install PyPDF2>=3.0.0

Setup

Install PyPDF2 via pip to handle PDF reading and text extraction.

bash

pip install PyPDF2>=3.0.0

Step by step

This example reads a PDF file, extracts text page by page, and groups pages into chunks of a specified size.

python

from PyPDF2 import PdfReader

# Function to chunk PDF text by pages

def pdf_chunker(file_path, chunk_size=5):
    reader = PdfReader(file_path)
    num_pages = len(reader.pages)
    chunks = []
    for i in range(0, num_pages, chunk_size):
        chunk_text = []
        for j in range(i, min(i + chunk_size, num_pages)):
            page = reader.pages[j]
            text = page.extract_text() or ""
            chunk_text.append(text)
        chunks.append("\n".join(chunk_text))
    return chunks

# Example usage
if __name__ == "__main__":
    pdf_path = "sample.pdf"  # Replace with your PDF file path
    chunks = pdf_chunker(pdf_path, chunk_size=3)
    for idx, chunk in enumerate(chunks):
        print(f"Chunk {idx + 1} (length {len(chunk)} chars):")
        print(chunk[:500])  # Print first 500 chars of chunk
        print("---")

output

Chunk 1 (length 2345 chars):
[First 500 characters of chunk 1 text]
---
Chunk 2 (length 1987 chars):
[First 500 characters of chunk 2 text]
---
... (and so on)

Common variations

Adjust chunk_size to control how many pages per chunk.
Use page.extract_text() carefully; some PDFs may require OCR or other libraries for better text extraction.
Combine with AI APIs by sending chunks as separate prompts for processing or embedding.

Troubleshooting

If extract_text() returns None, the PDF page might be scanned or image-based; consider using OCR tools like pytesseract.
Ensure the PDF file path is correct to avoid FileNotFoundError.
For very large PDFs, process chunks lazily or stream to avoid memory issues.

✅

Key Takeaways

Use PyPDF2.PdfReader to read and extract text from PDF pages.
Chunk PDFs by grouping multiple pages' text for manageable processing.
Adjust chunk size based on your application's memory and processing needs.

Verified 2026-04

Verify ↗