Why scanned PDFs break loading
Why this matters
If you skip PDF type detection and attempt to load a scanned PDF with a standard text-extraction loader, you'll get empty documents, zero embeddings, and a retriever that has nothing to search. Your entire RAG pipeline will silently fail with no matches returned.
Explanation
The core issue: PDF loaders like PyPDFLoader or pdfplumber extract text from the PDF structure. A scanned PDF is essentially a photograph: pixels arranged as images: with no underlying text layer. When you load it, you get an empty string or whitespace, not the visible content.
Why this happens: When a PDF is created from a scan (e.g., a photograph of a document or a directly scanned page), the software stores image data, not searchable text. The PDF container exists, but the document loader finds no text to extract.
What to watch for: Before building your entire RAG system, always verify your PDFs are text-based, not image-based. A 50KB PDF might be scanned (image); a 2MB PDF with the same number of pages is likely text-based. Opening the PDF in a viewer and seeing text does not mean the PDF has a text layer: you might just be seeing the rendered image.
Code
# pip install PyPDF2 langchain chromadb
from PyPDF2 import PdfReader
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
def check_pdf_type(pdf_path):
"""Detect if PDF has text layer or is scanned (image-only)."""
reader = PdfReader(pdf_path)
total_text = ''
for page in reader.pages:
total_text += page.extract_text()
text_length = len(total_text.strip())
is_text_based = text_length > 50
return {
'file': pdf_path,
'is_text_based': is_text_based,
'extracted_chars': text_length,
'status': 'OK - Load with PyPDFLoader' if is_text_based else 'SCANNED - Requires OCR'
}
result = check_pdf_type('sample.pdf')
print(result)
if result['is_text_based']:
loader = PyPDFLoader('sample.pdf')
docs = loader.load()
print(f"Loaded {len(docs)} pages")
if docs:
print(f"First 100 chars: {docs[0].page_content[:100]}")
else:
print("ERROR: This PDF is scanned. Use OCR before loading.") {'file': 'sample.pdf', 'is_text_based': True, 'extracted_chars': 1245, 'status': 'OK - Load with PyPDFLoader'}
Loaded 3 pages
First 100 chars: Chapter 1: Introduction to RAG
Retrieval-Augmented Generation combines vector search with language models... Your options
Use OCR (Optical Character Recognition) to extract text from scanned PDFs
You have scanned PDFs and need to include them in your RAG system. This is the correct long-term solution for production pipelines with mixed document types.
Pros
Converts image PDFs to text-searchable PDFs. Works for any scanned document. Handles mixed PDF types (some text-based, some scanned).
Cons
Slower than text extraction. OCR errors reduce embedding quality. Requires additional dependency (Tesseract or cloud service). Cost per page if using cloud OCR.
# pip install pdf2image pytesseract
import pytesseract
from pdf2image import convert_from_path
image_list = convert_from_path('scanned.pdf')
ocr_text = '\n'.join([pytesseract.image_to_string(img) for img in image_list])
print(ocr_text[:200]) Filter out scanned PDFs and load only text-based PDFs
You have a mixed document source and only need to index text-based PDFs. Quick solution when OCR is not available or cost-prohibitive.
Pros
No additional dependencies. Fast. No OCR errors. Clear data quality.
Cons
Loses data from scanned documents. Requires upfront detection logic. Not scalable if document source changes.
# pip install PyPDF2
from PyPDF2 import PdfReader
def has_text_layer(pdf_path):
reader = PdfReader(pdf_path)
text = ''.join([page.extract_text() for page in reader.pages])
return len(text.strip()) > 50
print(has_text_layer('document.pdf')) # True if text-based Use a commercial document processing API (AWS Textract, Azure Form Recognizer, Google Document AI)
Production RAG system with high-volume scanned documents, strict accuracy requirements, or mixed document types at scale.
Pros
High OCR accuracy. Handles complex layouts, tables, forms. Structured output. Auto-detects document type.
Cons
Higher cost. Network latency. Dependency on external service. Requires API keys and authentication.
# pip install boto3
import boto3
client = boto3.client('textract')
with open('scanned.pdf', 'rb') as f:
response = client.detect_document_text(Document={'Bytes': f.read()})
text = '\n'.join([block['Text'] for block in response['Blocks'] if block['BlockType'] == 'LINE'])
print(text[:200]) Validation step
After loading a PDF, immediately print the first page's <code>page_content</code>. If it contains readable text (not empty string or garbled characters), the PDF has a text layer. If it's blank or contains only whitespace, you have a scanned PDF. A production check: if <code>len(extracted_text.strip()) < 50</code> after extraction, reject the PDF and log it for manual OCR processing.
At scale
At scale (1000+ documents), undetected scanned PDFs silently poison your vector store. A single scanned PDF embedded as empty text will still create a vector (all zeros), reducing retriever quality. If 10% of your document source becomes scanned PDFs (e.g., due to supplier format change), your RAG accuracy drops without triggering an alert. Always implement detection at ingestion time, not after building the store.
Rollback plan
If you discover scanned PDFs are already in your vector store: (1) Identify them using <code>check_pdf_type()</code> on all source files. (2) Delete the vector store indices for scanned documents using <code>vectorstore.delete(ids=[...])</code>. (3) Re-process scanned PDFs with OCR. (4) Re-embed and re-ingest. Do not attempt to update embeddings in-place: delete and rebuild is safer.
Debug symptoms
RAG returns 'No relevant documents found' for every query, even obvious matches
Diagnosis
PDFs were loaded successfully (no error), but document content is empty because PDFs are scanned
Fix
Run <code>check_pdf_type()</code> on all source PDFs. If any return <code>extracted_chars < 50</code>, apply OCR before loading.
PyPDFLoader runs without errors but <code>len(docs)</code> is high while <code>len(docs[0].page_content)</code> is 0 or whitespace
Diagnosis
PDF structure exists (pages detected) but no text extracted from pages
Fix
You have scanned PDFs. Use <code>pdfplumber</code> or <code>pytesseract</code> to check: <code>pdf.pages[0].extract_text()</code> returns <code>None</code> for scanned PDFs.
Embeddings are created (no error) but vector store searches return random irrelevant results
Diagnosis
Empty documents were embedded as zero vectors, diluting your semantic search space
Fix
Clear vector store. Validate PDFs with <code>check_pdf_type()</code>. Re-process scanned PDFs with OCR. Re-build embeddings.
Production upgrade path
Beginner version: Use <code>PyPDFLoader</code> and check output manually. Production version: (1) Implement <code>validate_documents(pdf_path)</code> as a pre-ingestion gate. (2) Use AWS Textract or Google Document AI for at-scale OCR (more accurate than Tesseract). (3) Store document metadata including <code>ocr_required: bool</code> and <code>extraction_confidence: float</code>. (4) Implement async OCR processing for scanned PDFs (don't block vector store ingestion). (5) Monitor and alert on <code>empty_document_count</code> during ingestion.
Common gotcha
A PDF can open and display perfectly in your viewer while having zero text layer. File size is not a reliable indicator: a 100-page scan might be smaller than a 10-page text PDF depending on compression. The insidious part: your loader will silently return empty documents with no error, and your retriever will still index them, creating "ghost" vectors that look valid but match nothing.
Experienced dev note
Scanned PDF detection should be a mandatory pre-processing gate, not a troubleshooting step. In production RAG pipelines at scale (enterprise document processing), approximately 15-25% of real-world PDFs are scanned. Most teams discover this the hard way after building a vector store with 10K+ documents. The fix: always implement a validate_documents() function in your ingestion pipeline that rejects PDFs with extracted text below a threshold (e.g., 100 chars per page on average). Log rejected documents separately and queue them for manual OCR review. This prevents poisoning your vector store and gives you a clear signal when your document source changes.
Check your understanding
If you have a PDF that loads without error, produces 10 documents (one per page), but the retriever returns nothing for any query, what is the most likely cause and how would you confirm it?
Show answer hint
Extract text from the first page and check its length. If <code>len(docs[0].page_content.strip()) == 0</code>, the PDF is scanned. This is the most common silent failure in RAG systems because the loader succeeds but returns empty content.