Handle scanned PDF extraction errors
pytesseract to convert images to text before extraction. Handle errors by detecting PDF type and applying OCR accordingly.model_behavior pytesseract or use an AI API that supports scanned PDF OCR to avoid extraction errors.Why this happens
Scanned PDFs are essentially images embedded in PDF containers without embedded text layers. Attempting to extract text directly using standard PDF parsers or AI document extraction APIs results in empty or garbled output because there is no selectable text.
Typical error output includes empty strings, null responses, or nonsensical characters when calling text extraction methods on scanned PDFs.
Example broken code snippet:
from PyPDF2 import PdfReader
reader = PdfReader("scanned_document.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() # Returns None or empty string for scanned PDFs
print(text) The fix
Use OCR to convert scanned PDF images into text before extraction. You can preprocess scanned PDFs with pytesseract and pdf2image to convert PDF pages to images and then extract text via OCR.
This approach works because OCR recognizes characters in images, enabling accurate text extraction from scanned documents.
import os
from pdf2image import convert_from_path
import pytesseract
# Convert scanned PDF pages to images
pages = convert_from_path("scanned_document.pdf", dpi=300)
# Extract text from each image page using OCR
text = ""
for page in pages:
text += pytesseract.image_to_string(page)
print(text) Extracted text from scanned PDF pages printed here...
Preventing it in production
Implement detection logic to distinguish scanned PDFs from text-based PDFs. For scanned PDFs, apply OCR preprocessing automatically.
Use AI document extraction APIs that support OCR natively or integrate OCR libraries in your pipeline.
Incorporate retry and fallback mechanisms to handle extraction failures gracefully and log errors for monitoring.
def is_scanned_pdf(pdf_path: str) -> bool:
from PyPDF2 import PdfReader
reader = PdfReader(pdf_path)
for page in reader.pages:
if page.extract_text():
return False
return True
pdf_path = "document.pdf"
if is_scanned_pdf(pdf_path):
# Use OCR extraction
from pdf2image import convert_from_path
import pytesseract
pages = convert_from_path(pdf_path, dpi=300)
text = "".join(pytesseract.image_to_string(page) for page in pages)
else:
# Use direct text extraction
reader = PdfReader(pdf_path)
text = "".join(page.extract_text() or "" for page in reader.pages)
print(text) Extracted text from PDF, using OCR if scanned, else direct extraction.
Key Takeaways
- Scanned PDFs require OCR preprocessing before text extraction to avoid empty or invalid results.
- Use
pdf2imageandpytesseractto convert scanned PDF pages to text reliably. - Detect scanned PDFs programmatically to apply the correct extraction method automatically.