Debug Fix intermediate · 3 min read

Handle scanned PDF extraction errors

Quick answer

Scanned PDFs require OCR processing before text extraction, as raw text extraction fails on image-based PDFs. Use an OCR-enabled AI API or preprocess scanned PDFs with OCR libraries like pytesseract to convert images to text before extraction. Handle errors by detecting PDF type and applying OCR accordingly.

ERROR TYPE model_behavior

⚡ QUICK FIX

Add OCR preprocessing with pytesseract or use an AI API that supports scanned PDF OCR to avoid extraction errors.

Why this happens

Scanned PDFs are essentially images embedded in PDF containers without embedded text layers. Attempting to extract text directly using standard PDF parsers or AI document extraction APIs results in empty or garbled output because there is no selectable text.

Typical error output includes empty strings, null responses, or nonsensical characters when calling text extraction methods on scanned PDFs.

Example broken code snippet:

python

from PyPDF2 import PdfReader

reader = PdfReader("scanned_document.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text()  # Returns None or empty string for scanned PDFs
print(text)

output

The fix

Use OCR to convert scanned PDF images into text before extraction. You can preprocess scanned PDFs with pytesseract and pdf2image to convert PDF pages to images and then extract text via OCR.

This approach works because OCR recognizes characters in images, enabling accurate text extraction from scanned documents.

python

import os
from pdf2image import convert_from_path
import pytesseract

# Convert scanned PDF pages to images
pages = convert_from_path("scanned_document.pdf", dpi=300)

# Extract text from each image page using OCR
text = ""
for page in pages:
    text += pytesseract.image_to_string(page)

print(text)

output

Extracted text from scanned PDF pages printed here...

Preventing it in production

Implement detection logic to distinguish scanned PDFs from text-based PDFs. For scanned PDFs, apply OCR preprocessing automatically.

Use AI document extraction APIs that support OCR natively or integrate OCR libraries in your pipeline.

Incorporate retry and fallback mechanisms to handle extraction failures gracefully and log errors for monitoring.

python

def is_scanned_pdf(pdf_path: str) -> bool:
    from PyPDF2 import PdfReader
    reader = PdfReader(pdf_path)
    for page in reader.pages:
        if page.extract_text():
            return False
    return True

pdf_path = "document.pdf"
if is_scanned_pdf(pdf_path):
    # Use OCR extraction
    from pdf2image import convert_from_path
    import pytesseract
    pages = convert_from_path(pdf_path, dpi=300)
    text = "".join(pytesseract.image_to_string(page) for page in pages)
else:
    # Use direct text extraction
    reader = PdfReader(pdf_path)
    text = "".join(page.extract_text() or "" for page in reader.pages)

print(text)

output

Extracted text from PDF, using OCR if scanned, else direct extraction.

Related errors

Error	Cause	Quick fix
Empty extraction result	PDF is scanned image without text layer	Use OCR preprocessing with `pytesseract`
Garbled characters	Incorrect encoding or corrupted PDF	Validate PDF encoding or repair PDF before extraction
API returns null text	Model not configured for OCR or scanned docs	Use AI API with OCR support or preprocess with OCR locally

✅

Key Takeaways

Scanned PDFs require OCR preprocessing before text extraction to avoid empty or invalid results.
Use pdf2image and pytesseract to convert scanned PDF pages to text reliably.
Detect scanned PDFs programmatically to apply the correct extraction method automatically.

Verified 2026-04

Verify ↗