Workflow Beginner easy · 5 min problem_statement

Why scanned PDFs break loading

What you will learn
Scanned PDFs (image-only, no text layer) fail to load into RAG because PDF loaders extract text, not pixels: and there is none to extract.
Step 2: Document Loading & Parsing: after selecting your document source, before chunking

Why this matters

If you skip PDF type detection and attempt to load a scanned PDF with a standard text-extraction loader, you'll get empty documents, zero embeddings, and a retriever that has nothing to search. Your entire RAG pipeline will silently fail with no matches returned.

Explanation

The core issue: PDF loaders like PyPDFLoader or pdfplumber extract text from the PDF structure. A scanned PDF is essentially a photograph: pixels arranged as images: with no underlying text layer. When you load it, you get an empty string or whitespace, not the visible content.

Why this happens: When a PDF is created from a scan (e.g., a photograph of a document or a directly scanned page), the software stores image data, not searchable text. The PDF container exists, but the document loader finds no text to extract.

What to watch for: Before building your entire RAG system, always verify your PDFs are text-based, not image-based. A 50KB PDF might be scanned (image); a 2MB PDF with the same number of pages is likely text-based. Opening the PDF in a viewer and seeing text does not mean the PDF has a text layer: you might just be seeing the rendered image.

Code

Illustrative only - not runnable without a valid API key
python
# pip install PyPDF2 langchain chromadb
from PyPDF2 import PdfReader
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

def check_pdf_type(pdf_path):
    """Detect if PDF has text layer or is scanned (image-only)."""
    reader = PdfReader(pdf_path)
    total_text = ''
    for page in reader.pages:
        total_text += page.extract_text()
    
    text_length = len(total_text.strip())
    is_text_based = text_length > 50
    
    return {
        'file': pdf_path,
        'is_text_based': is_text_based,
        'extracted_chars': text_length,
        'status': 'OK - Load with PyPDFLoader' if is_text_based else 'SCANNED - Requires OCR'
    }

result = check_pdf_type('sample.pdf')
print(result)

if result['is_text_based']:
    loader = PyPDFLoader('sample.pdf')
    docs = loader.load()
    print(f"Loaded {len(docs)} pages")
    if docs:
        print(f"First 100 chars: {docs[0].page_content[:100]}")
else:
    print("ERROR: This PDF is scanned. Use OCR before loading.")
Output
{'file': 'sample.pdf', 'is_text_based': True, 'extracted_chars': 1245, 'status': 'OK - Load with PyPDFLoader'}
Loaded 3 pages
First 100 chars: Chapter 1: Introduction to RAG

Retrieval-Augmented Generation combines vector search with language models...

Your options

Filter out scanned PDFs and load only text-based PDFs

You have a mixed document source and only need to index text-based PDFs. Quick solution when OCR is not available or cost-prohibitive.

Pros

No additional dependencies. Fast. No OCR errors. Clear data quality.

Cons

Loses data from scanned documents. Requires upfront detection logic. Not scalable if document source changes.

# pip install PyPDF2
from PyPDF2 import PdfReader

def has_text_layer(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''.join([page.extract_text() for page in reader.pages])
    return len(text.strip()) > 50

print(has_text_layer('document.pdf'))  # True if text-based

Use a commercial document processing API (AWS Textract, Azure Form Recognizer, Google Document AI)

Production RAG system with high-volume scanned documents, strict accuracy requirements, or mixed document types at scale.

Pros

High OCR accuracy. Handles complex layouts, tables, forms. Structured output. Auto-detects document type.

Cons

Higher cost. Network latency. Dependency on external service. Requires API keys and authentication.

# pip install boto3
import boto3

client = boto3.client('textract')
with open('scanned.pdf', 'rb') as f:
    response = client.detect_document_text(Document={'Bytes': f.read()})
    text = '\n'.join([block['Text'] for block in response['Blocks'] if block['BlockType'] == 'LINE'])
    print(text[:200])

Validation step

After loading a PDF, immediately print the first page's <code>page_content</code>. If it contains readable text (not empty string or garbled characters), the PDF has a text layer. If it's blank or contains only whitespace, you have a scanned PDF. A production check: if <code>len(extracted_text.strip()) < 50</code> after extraction, reject the PDF and log it for manual OCR processing.

At scale

At scale (1000+ documents), undetected scanned PDFs silently poison your vector store. A single scanned PDF embedded as empty text will still create a vector (all zeros), reducing retriever quality. If 10% of your document source becomes scanned PDFs (e.g., due to supplier format change), your RAG accuracy drops without triggering an alert. Always implement detection at ingestion time, not after building the store.

Rollback plan

If you discover scanned PDFs are already in your vector store: (1) Identify them using <code>check_pdf_type()</code> on all source files. (2) Delete the vector store indices for scanned documents using <code>vectorstore.delete(ids=[...])</code>. (3) Re-process scanned PDFs with OCR. (4) Re-embed and re-ingest. Do not attempt to update embeddings in-place: delete and rebuild is safer.

Debug symptoms

RAG returns 'No relevant documents found' for every query, even obvious matches

Diagnosis

PDFs were loaded successfully (no error), but document content is empty because PDFs are scanned

Fix

Run <code>check_pdf_type()</code> on all source PDFs. If any return <code>extracted_chars < 50</code>, apply OCR before loading.

PyPDFLoader runs without errors but <code>len(docs)</code> is high while <code>len(docs[0].page_content)</code> is 0 or whitespace

Diagnosis

PDF structure exists (pages detected) but no text extracted from pages

Fix

You have scanned PDFs. Use <code>pdfplumber</code> or <code>pytesseract</code> to check: <code>pdf.pages[0].extract_text()</code> returns <code>None</code> for scanned PDFs.

Embeddings are created (no error) but vector store searches return random irrelevant results

Diagnosis

Empty documents were embedded as zero vectors, diluting your semantic search space

Fix

Clear vector store. Validate PDFs with <code>check_pdf_type()</code>. Re-process scanned PDFs with OCR. Re-build embeddings.

Production upgrade path

Beginner version: Use <code>PyPDFLoader</code> and check output manually. Production version: (1) Implement <code>validate_documents(pdf_path)</code> as a pre-ingestion gate. (2) Use AWS Textract or Google Document AI for at-scale OCR (more accurate than Tesseract). (3) Store document metadata including <code>ocr_required: bool</code> and <code>extraction_confidence: float</code>. (4) Implement async OCR processing for scanned PDFs (don't block vector store ingestion). (5) Monitor and alert on <code>empty_document_count</code> during ingestion.

Common gotcha

A PDF can open and display perfectly in your viewer while having zero text layer. File size is not a reliable indicator: a 100-page scan might be smaller than a 10-page text PDF depending on compression. The insidious part: your loader will silently return empty documents with no error, and your retriever will still index them, creating "ghost" vectors that look valid but match nothing.

Experienced dev note

Scanned PDF detection should be a mandatory pre-processing gate, not a troubleshooting step. In production RAG pipelines at scale (enterprise document processing), approximately 15-25% of real-world PDFs are scanned. Most teams discover this the hard way after building a vector store with 10K+ documents. The fix: always implement a validate_documents() function in your ingestion pipeline that rejects PDFs with extracted text below a threshold (e.g., 100 chars per page on average). Log rejected documents separately and queue them for manual OCR review. This prevents poisoning your vector store and gives you a clear signal when your document source changes.

Check your understanding

If you have a PDF that loads without error, produces 10 documents (one per page), but the retriever returns nothing for any query, what is the most likely cause and how would you confirm it?

Show answer hint

Extract text from the first page and check its length. If <code>len(docs[0].page_content.strip()) == 0</code>, the PDF is scanned. This is the most common silent failure in RAG systems because the loader succeeds but returns empty content.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.