Debug Fix intermediate · 3 min read

How to handle scanned PDFs with OCR

Quick answer
To handle scanned PDFs, use an OCR tool or API like OpenAI Whisper or Google Vision OCR to extract text from images embedded in PDFs. Convert PDF pages to images first, then apply OCR to get machine-readable text for further AI processing.
ERROR TYPE code_error
⚡ QUICK FIX
Convert scanned PDF pages to images before sending them to an OCR API like whisper-1 for transcription.

Why this happens

Scanned PDFs contain images of text rather than selectable text, so direct text extraction methods like pdfminer or PyPDF2 return no or garbled text. Attempting to process scanned PDFs without OCR leads to empty or incorrect outputs.

Typical broken code tries to extract text directly:

python
from PyPDF2 import PdfReader

reader = PdfReader("scanned_document.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() or ""
print(text)  # Outputs empty or gibberish for scanned PDFs
output
 

The fix

Convert each PDF page to an image using pdf2image, then send the image to an OCR API like OpenAI Whisper for transcription. This extracts accurate text from scanned pages.

This works because OCR models process images to recognize text, unlike PDF text extractors that expect embedded text.

python
import os
from pdf2image import convert_from_path
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Convert PDF pages to images
pages = convert_from_path("scanned_document.pdf", dpi=300)

full_text = ""
for i, page in enumerate(pages):
    image_path = f"page_{i}.png"
    page.save(image_path, "PNG")
    with open(image_path, "rb") as f:
        transcript = client.audio.transcriptions.create(model="whisper-1", file=f)
    full_text += transcript.text + "\n"
    os.remove(image_path)

print(full_text)
output
Extracted text from scanned PDF pages printed here...

Preventing it in production

  • Validate input PDFs to detect scanned vs. text PDFs before processing.
  • Use retries with exponential backoff for OCR API calls to handle transient errors.
  • Cache OCR results for repeated documents to reduce costs and latency.
  • Fallback to alternative OCR providers if one service is unavailable.

Key Takeaways

  • Scanned PDFs require converting pages to images before OCR extraction.
  • Use OpenAI Whisper or similar OCR APIs for accurate text from images.
  • Implement retries and input validation to ensure robust OCR processing in production.
Verified 2026-04 · whisper-1, gpt-4o
Verify ↗