Debug Fix intermediate · 3 min read

How to handle scanned PDFs with OCR

Q: How to handle scanned PDFs with OCR

To handle scanned PDFs, use an OCR tool or API like OpenAI Whisper or Google Vision OCR to extract text from images embedded in PDFs. Convert PDF pages to images first, then apply OCR to get machine-readable text for further AI processing.

Quick answer

To handle scanned PDFs, use an OCR tool or API like OpenAI Whisper or Google Vision OCR to extract text from images embedded in PDFs. Convert PDF pages to images first, then apply OCR to get machine-readable text for further AI processing.

ERROR TYPE code_error

QUICK FIX

Convert scanned PDF pages to images before sending them to an OCR API like whisper-1 for transcription.

Why this happens

Scanned PDFs contain images of text rather than selectable text, so direct text extraction methods like pdfminer or PyPDF2 return no or garbled text. Attempting to process scanned PDFs without OCR leads to empty or incorrect outputs.

Typical broken code tries to extract text directly:

python

from PyPDF2 import PdfReader

reader = PdfReader("scanned_document.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() or ""
print(text)  # Outputs empty or gibberish for scanned PDFs

output

The fix

Convert each PDF page to an image using pdf2image, then send the image to an OCR API like OpenAI Whisper for transcription. This extracts accurate text from scanned pages.

This works because OCR models process images to recognize text, unlike PDF text extractors that expect embedded text.

python

import os
from pdf2image import convert_from_path
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Convert PDF pages to images
pages = convert_from_path("scanned_document.pdf", dpi=300)

full_text = ""
for i, page in enumerate(pages):
    image_path = f"page_{i}.png"
    page.save(image_path, "PNG")
    with open(image_path, "rb") as f:
        transcript = client.audio.transcriptions.create(model="whisper-1", file=f)
    full_text += transcript.text + "\n"
    os.remove(image_path)

print(full_text)

output

Extracted text from scanned PDF pages printed here...

Preventing it in production

Validate input PDFs to detect scanned vs. text PDFs before processing.
Use retries with exponential backoff for OCR API calls to handle transient errors.
Cache OCR results for repeated documents to reduce costs and latency.
Fallback to alternative OCR providers if one service is unavailable.

Related errors

Error	Cause	Quick fix
Empty text extraction	PDF contains scanned images, not text	Use OCR on page images
API RateLimitError	Too many OCR requests in short time	Add exponential backoff retry logic
Corrupted image file	Improper PDF to image conversion	Verify image files before OCR

Key Takeaways

Scanned PDFs require converting pages to images before OCR extraction.
Use OpenAI Whisper or similar OCR APIs for accurate text from images.
Implement retries and input validation to ensure robust OCR processing in production.

Verified 2026-04 · whisper-1, gpt-4o

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.