How to extract text from PDF with Python
Direct answer
Use the
PyPDF2 library in Python to extract text from PDF files by reading each page's content; optionally, send extracted text to an AI model like gpt-4o for further processing.Setup
Install
pip install PyPDF2 openai Env vars
OPENAI_API_KEY Imports
import os
from PyPDF2 import PdfReader
from openai import OpenAI Examples
inA PDF file with 2 pages containing simple text paragraphs.
outExtracted text concatenated from both pages, printed to console.
inA scanned PDF with embedded text layers.
outExtracted text from text layers; if scanned image only, text extraction will be empty.
inEmpty or encrypted PDF file.
outEmpty string or error message indicating extraction failure.
Integration steps
- Install PyPDF2 and OpenAI Python packages.
- Load the PDF file using PyPDF2's PdfReader.
- Iterate through each page and extract text content.
- Optionally, initialize OpenAI client with API key from environment.
- Send extracted text to OpenAI chat completion for summarization or analysis.
- Print or save the extracted and/or processed text.
Full code
import os
from PyPDF2 import PdfReader
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Path to your PDF file
pdf_path = "sample.pdf"
# Extract text from PDF
reader = PdfReader(pdf_path)
full_text = ""
for page in reader.pages:
text = page.extract_text()
if text:
full_text += text + "\n"
print("Extracted Text from PDF:")
print(full_text)
# Optional: Use OpenAI to summarize extracted text
if full_text.strip():
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": f"Summarize the following text:\n{full_text}"}
]
)
summary = response.choices[0].message.content
print("\nSummary from OpenAI GPT-4o:")
print(summary)
else:
print("No text extracted from PDF.") API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize the following text:\n<extracted_text>"}]} Response
{"choices": [{"message": {"content": "<summary_text>"}}], "usage": {"total_tokens": 150}} Extract
response.choices[0].message.contentVariants
Streaming summary with OpenAI ›
Use streaming to display the summary progressively for large extracted texts.
import os
from PyPDF2 import PdfReader
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)
print("Extracted Text from PDF:")
print(full_text)
if full_text.strip():
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize the following text:\n{full_text}"}],
stream=True
)
print("\nStreaming summary from OpenAI GPT-4o:")
for chunk in response:
print(chunk.choices[0].delta.get('content', ''), end='')
print()
else:
print("No text extracted from PDF.") Async extraction and summarization ›
Use async for integrating PDF text extraction and AI calls in asynchronous applications.
import os
import asyncio
from PyPDF2 import PdfReader
from openai import OpenAI
async def extract_and_summarize(pdf_path: str):
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)
print("Extracted Text from PDF:")
print(full_text)
if full_text.strip():
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize the following text:\n{full_text}"}]
)
summary = response.choices[0].message.content
print("\nSummary from OpenAI GPT-4o:")
print(summary)
else:
print("No text extracted from PDF.")
asyncio.run(extract_and_summarize("sample.pdf")) Extract text only with PyPDF2 (no AI) ›
Use when you only need raw text extraction from PDFs without AI processing.
from PyPDF2 import PdfReader
pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
full_text = "".join((page.extract_text() or "") + "\n" for page in reader.pages)
print("Extracted Text from PDF:")
print(full_text) Performance
Latency~500ms to 2s for PDF text extraction plus ~800ms for OpenAI summarization (non-streaming)
Cost~$0.002 per 500 tokens for <code>gpt-4o</code> summarization calls
Rate limitsOpenAI default tier: 350 RPM / 60K TPM
- Extract only relevant pages to reduce token count.
- Summarize in chunks if PDF text is very large.
- Use smaller models like <code>gpt-4o-mini</code> for cost savings.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| PyPDF2 + OpenAI gpt-4o | ~2-3s total | ~$0.002 per 500 tokens | Accurate extraction + AI summarization |
| PyPDF2 only | ~500ms | Free | Simple text extraction without AI |
| Streaming OpenAI summary | ~2-3s with progressive output | ~$0.002 per 500 tokens | Better UX for large texts |
Quick tip
Use <code>PyPDF2</code> for reliable text extraction from PDFs and send the text to an AI model like <code>gpt-4o</code> for summarization or analysis.
Common mistake
Beginners often forget to check if <code>extract_text()</code> returns None for some pages, causing errors when concatenating.