Code beginner · 3 min read

How to extract text from PDF with Python

Q: How to extract text from PDF with Python

Use the PyPDF2 library in Python to extract text from PDF files by reading each page's content; optionally, send extracted text to an AI model like gpt-4o for further processing.

Direct answer

Use the PyPDF2 library in Python to extract text from PDF files by reading each page's content; optionally, send extracted text to an AI model like gpt-4o for further processing.

Setup

Install

bash

pip install PyPDF2 openai

Env vars

OPENAI_API_KEY

Imports

python

import os
from PyPDF2 import PdfReader
from openai import OpenAI

Examples

inA PDF file with 2 pages containing simple text paragraphs.

outExtracted text concatenated from both pages, printed to console.

inA scanned PDF with embedded text layers.

outExtracted text from text layers; if scanned image only, text extraction will be empty.

inEmpty or encrypted PDF file.

outEmpty string or error message indicating extraction failure.

Integration steps

Install PyPDF2 and OpenAI Python packages.
Load the PDF file using PyPDF2's PdfReader.
Iterate through each page and extract text content.
Optionally, initialize OpenAI client with API key from environment.
Send extracted text to OpenAI chat completion for summarization or analysis.
Print or save the extracted and/or processed text.

Full code

python

import os
from PyPDF2 import PdfReader
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Path to your PDF file
pdf_path = "sample.pdf"

# Extract text from PDF
reader = PdfReader(pdf_path)
full_text = ""
for page in reader.pages:
    text = page.extract_text()
    if text:
        full_text += text + "\n"

print("Extracted Text from PDF:")
print(full_text)

# Optional: Use OpenAI to summarize extracted text
if full_text.strip():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": f"Summarize the following text:\n{full_text}"}
        ]
    )
    summary = response.choices[0].message.content
    print("\nSummary from OpenAI GPT-4o:")
    print(summary)
else:
    print("No text extracted from PDF.")

API trace

Request

json

{"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize the following text:\n<extracted_text>"}]}

Response

json

{"choices": [{"message": {"content": "<summary_text>"}}], "usage": {"total_tokens": 150}}

Extractresponse.choices[0].message.content

Variants

Streaming summary with OpenAI ›

Use streaming to display the summary progressively for large extracted texts.

python

import os
from PyPDF2 import PdfReader
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)

print("Extracted Text from PDF:")
print(full_text)

if full_text.strip():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Summarize the following text:\n{full_text}"}],
        stream=True
    )
    print("\nStreaming summary from OpenAI GPT-4o:")
    for chunk in response:
        print(chunk.choices[0].delta.get('content', ''), end='')
    print()
else:
    print("No text extracted from PDF.")

Async extraction and summarization ›

Use async for integrating PDF text extraction and AI calls in asynchronous applications.

python

import os
import asyncio
from PyPDF2 import PdfReader
from openai import OpenAI

async def extract_and_summarize(pdf_path: str):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    reader = PdfReader(pdf_path)
    full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)

    print("Extracted Text from PDF:")
    print(full_text)

    if full_text.strip():
        response = await client.chat.completions.acreate(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Summarize the following text:\n{full_text}"}]
        )
        summary = response.choices[0].message.content
        print("\nSummary from OpenAI GPT-4o:")
        print(summary)
    else:
        print("No text extracted from PDF.")

asyncio.run(extract_and_summarize("sample.pdf"))

Extract text only with PyPDF2 (no AI) ›

Use when you only need raw text extraction from PDFs without AI processing.

python

from PyPDF2 import PdfReader

pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)
print("Extracted Text from PDF:")
print(full_text)

Performance

Latency~500ms to 2s for PDF text extraction plus ~800ms for OpenAI summarization (non-streaming)

Cost~$0.002 per 500 tokens for <code>gpt-4o</code> summarization calls

Rate limitsOpenAI default tier: 350 RPM / 60K TPM

Extract only relevant pages to reduce token count.
Summarize in chunks if PDF text is very large.
Use smaller models like <code>gpt-4o-mini</code> for cost savings.

Approach	Latency	Cost/call	Best for
PyPDF2 + OpenAI gpt-4o	~2-3s total	~$0.002 per 500 tokens	Accurate extraction + AI summarization
PyPDF2 only	~500ms	Free	Simple text extraction without AI
Streaming OpenAI summary	~2-3s with progressive output	~$0.002 per 500 tokens	Better UX for large texts

✓

Quick tip

Use <code>PyPDF2</code> for reliable text extraction from PDFs and send the text to an AI model like <code>gpt-4o</code> for summarization or analysis.

⚠

Common mistake

Beginners often forget to check if <code>extract_text()</code> returns None for some pages, causing errors when concatenating.

Verified 2026-04 · gpt-4o

Verify ↗