Code beginner · 3 min read

How to extract text from PDF with Python

Direct answer
Use the PyPDF2 library in Python to extract text from PDF files by reading each page's content; optionally, send extracted text to an AI model like gpt-4o for further processing.

Setup

Install
bash
pip install PyPDF2 openai
Env vars
OPENAI_API_KEY
Imports
python
import os
from PyPDF2 import PdfReader
from openai import OpenAI

Examples

inA PDF file with 2 pages containing simple text paragraphs.
outExtracted text concatenated from both pages, printed to console.
inA scanned PDF with embedded text layers.
outExtracted text from text layers; if scanned image only, text extraction will be empty.
inEmpty or encrypted PDF file.
outEmpty string or error message indicating extraction failure.

Integration steps

  1. Install PyPDF2 and OpenAI Python packages.
  2. Load the PDF file using PyPDF2's PdfReader.
  3. Iterate through each page and extract text content.
  4. Optionally, initialize OpenAI client with API key from environment.
  5. Send extracted text to OpenAI chat completion for summarization or analysis.
  6. Print or save the extracted and/or processed text.

Full code

python
import os
from PyPDF2 import PdfReader
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Path to your PDF file
pdf_path = "sample.pdf"

# Extract text from PDF
reader = PdfReader(pdf_path)
full_text = ""
for page in reader.pages:
    text = page.extract_text()
    if text:
        full_text += text + "\n"

print("Extracted Text from PDF:")
print(full_text)

# Optional: Use OpenAI to summarize extracted text
if full_text.strip():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": f"Summarize the following text:\n{full_text}"}
        ]
    )
    summary = response.choices[0].message.content
    print("\nSummary from OpenAI GPT-4o:")
    print(summary)
else:
    print("No text extracted from PDF.")

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize the following text:\n<extracted_text>"}]}
Response
json
{"choices": [{"message": {"content": "<summary_text>"}}], "usage": {"total_tokens": 150}}
Extractresponse.choices[0].message.content

Variants

Streaming summary with OpenAI

Use streaming to display the summary progressively for large extracted texts.

python
import os
from PyPDF2 import PdfReader
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)

print("Extracted Text from PDF:")
print(full_text)

if full_text.strip():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Summarize the following text:\n{full_text}"}],
        stream=True
    )
    print("\nStreaming summary from OpenAI GPT-4o:")
    for chunk in response:
        print(chunk.choices[0].delta.get('content', ''), end='')
    print()
else:
    print("No text extracted from PDF.")
Async extraction and summarization

Use async for integrating PDF text extraction and AI calls in asynchronous applications.

python
import os
import asyncio
from PyPDF2 import PdfReader
from openai import OpenAI

async def extract_and_summarize(pdf_path: str):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    reader = PdfReader(pdf_path)
    full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)

    print("Extracted Text from PDF:")
    print(full_text)

    if full_text.strip():
        response = await client.chat.completions.acreate(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Summarize the following text:\n{full_text}"}]
        )
        summary = response.choices[0].message.content
        print("\nSummary from OpenAI GPT-4o:")
        print(summary)
    else:
        print("No text extracted from PDF.")

asyncio.run(extract_and_summarize("sample.pdf"))
Extract text only with PyPDF2 (no AI)

Use when you only need raw text extraction from PDFs without AI processing.

python
from PyPDF2 import PdfReader

pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)
print("Extracted Text from PDF:")
print(full_text)

Performance

Latency~500ms to 2s for PDF text extraction plus ~800ms for OpenAI summarization (non-streaming)
Cost~$0.002 per 500 tokens for <code>gpt-4o</code> summarization calls
Rate limitsOpenAI default tier: 350 RPM / 60K TPM
  • Extract only relevant pages to reduce token count.
  • Summarize in chunks if PDF text is very large.
  • Use smaller models like <code>gpt-4o-mini</code> for cost savings.
ApproachLatencyCost/callBest for
PyPDF2 + OpenAI gpt-4o~2-3s total~$0.002 per 500 tokensAccurate extraction + AI summarization
PyPDF2 only~500msFreeSimple text extraction without AI
Streaming OpenAI summary~2-3s with progressive output~$0.002 per 500 tokensBetter UX for large texts

Quick tip

Use <code>PyPDF2</code> for reliable text extraction from PDFs and send the text to an AI model like <code>gpt-4o</code> for summarization or analysis.

Common mistake

Beginners often forget to check if <code>extract_text()</code> returns None for some pages, causing errors when concatenating.

Verified 2026-04 · gpt-4o
Verify ↗