How to handle scanned PDFs with OCR
Quick answer
To handle scanned PDFs, use an OCR tool or API like
OpenAI Whisper or Google Vision OCR to extract text from images embedded in PDFs. Convert PDF pages to images first, then apply OCR to get machine-readable text for further AI processing. ERROR TYPE
code_error ⚡ QUICK FIX
Convert scanned PDF pages to images before sending them to an OCR API like
whisper-1 for transcription.Why this happens
Scanned PDFs contain images of text rather than selectable text, so direct text extraction methods like pdfminer or PyPDF2 return no or garbled text. Attempting to process scanned PDFs without OCR leads to empty or incorrect outputs.
Typical broken code tries to extract text directly:
from PyPDF2 import PdfReader
reader = PdfReader("scanned_document.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() or ""
print(text) # Outputs empty or gibberish for scanned PDFs output
The fix
Convert each PDF page to an image using pdf2image, then send the image to an OCR API like OpenAI Whisper for transcription. This extracts accurate text from scanned pages.
This works because OCR models process images to recognize text, unlike PDF text extractors that expect embedded text.
import os
from pdf2image import convert_from_path
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Convert PDF pages to images
pages = convert_from_path("scanned_document.pdf", dpi=300)
full_text = ""
for i, page in enumerate(pages):
image_path = f"page_{i}.png"
page.save(image_path, "PNG")
with open(image_path, "rb") as f:
transcript = client.audio.transcriptions.create(model="whisper-1", file=f)
full_text += transcript.text + "\n"
os.remove(image_path)
print(full_text) output
Extracted text from scanned PDF pages printed here...
Preventing it in production
- Validate input PDFs to detect scanned vs. text PDFs before processing.
- Use retries with exponential backoff for OCR API calls to handle transient errors.
- Cache OCR results for repeated documents to reduce costs and latency.
- Fallback to alternative OCR providers if one service is unavailable.
Key Takeaways
- Scanned PDFs require converting pages to images before OCR extraction.
- Use
OpenAI Whisperor similar OCR APIs for accurate text from images. - Implement retries and input validation to ensure robust OCR processing in production.