High severity intermediate · Fix: 10-20 min

garbled_text_output

pypdf.PdfReader.extract_text(): encoding/character output error

What this error means

pypdf's extract_text() returns mojibake (garbled, unreadable characters) instead of clean text, usually due to PDF font encoding mismatch, missing character maps, or incorrect text extraction strategy.

Stack trace

traceback

No exception raised — text extracts successfully but contains garbled output:

>>> from pypdf import PdfReader
>>> reader = PdfReader('document.pdf')
>>> text = reader.pages[0].extract_text()
>>> print(text)
'\x00H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d'  # or similar mojibake
# OR
'àáâãäå ÂçÈ¶£' # random Unicode replacement characters
# OR
'???? ???? ????'  # replacement character U+FFFD where text should be

QUICK FIX

Try pdfplumber.open(pdf_path).pages[0].extract_text() first: it handles encoding better than pypdf for most real-world PDFs, or fall back to pytesseract for scanned documents.

Why it happens

PDFs embed text in multiple ways: some store actual Unicode text, others use font-specific character mappings (like custom fonts or CID fonts). pypdf's extract_text() tries to decode these mappings automatically, but fails when: (1) the PDF uses a proprietary font without embedded character maps, (2) text is stored as CID (Composite Identity) without a ToUnicode mapping, or (3) the PDF was created with broken encoding metadata. pypdf then falls back to byte sequences or placeholder characters instead of readable text.

Detection

After extraction, check the output for replacement character U+FFFD ('?'), null bytes ('\x00'), or non-ASCII sequences that don't match your document's language. Add text validation before processing: `if '\ufffd' in text or '\x00' in text: log_warning('Possible encoding issue detected')`.

Causes & fixes

PDF uses custom/embedded fonts without ToUnicode character mapping

✓ Fix

Switch to pdfplumber or pymupdf (fitz) which have better font encoding handling, or use OCR fallback via pytesseract/Tesseract when pypdf output fails validation

Text stored as CID (Composite Identity) strings instead of Unicode

✓ Fix

Enable pypdf's advanced extraction: use `extract_text(extraction_mode='layout')` or try `extract_pages()` with manual layout analysis to recover word order

PDF metadata lists wrong encoding or character set

✓ Fix

Try pypdf's `visitor_text` parameter or switch to pdfplumber which re-analyzes character boxes independently of metadata

Scanned PDF (image-based): pypdf cannot extract text from images

✓ Fix

Use Tesseract OCR (`pytesseract`) or AWS Textract, or convert PDF to images and pass to GPT-4o vision API for text recognition

Code: broken vs fixed

Broken - triggers the error

python

import os
from pypdf import PdfReader

pdf_path = 'document.pdf'
reader = PdfReader(pdf_path)

# This may return garbled text for PDFs with custom fonts
text = reader.pages[0].extract_text()  # <-- Line that produces garbled output
print(text)
print(repr(text))  # Shows mojibake: '\x00H\x00e\x00l\x00l\x00o' or 'àáâã'

Fixed - works correctly

python

import os
from pypdf import PdfReader
import pdfplumber

pdf_path = 'document.pdf'

# Strategy 1: Try pdfplumber first (better encoding handling)
try:
    with pdfplumber.open(pdf_path) as pdf:
        text = pdf.pages[0].extract_text()
        # Validate extraction quality
        if '\ufffd' in text or len(text.strip()) < 5:
            raise ValueError('Low quality extraction — fallback to OCR')
        print('Extracted (pdfplumber):', text)
except Exception as e:
    print(f'pdfplumber failed: {e} — trying pypdf with layout mode')
    
    # Strategy 2: pypdf with layout extraction
    reader = PdfReader(pdf_path)
    text = reader.pages[0].extract_text(extraction_mode='layout')
    if '\ufffd' not in text and len(text.strip()) > 5:
        print('Extracted (pypdf layout):', text)
    else:
        # Strategy 3: Fall back to OCR for scanned/complex PDFs
        print('Resorting to OCR for this PDF...')
        try:
            import pytesseract
            from pdf2image import convert_from_path
            images = convert_from_path(pdf_path, first_page=1, last_page=1)
            text = pytesseract.image_to_string(images[0])
            print('Extracted (OCR):', text)
        except ImportError:
            print('Install pytesseract and pdf2image for OCR: pip install pytesseract pdf2image')

Added pdfplumber as primary strategy (superior encoding/font handling), pypdf layout mode as secondary, and pytesseract OCR as final fallback for scanned PDFs — with text quality validation to detect garbling automatically.

⚠

Workaround

Extract text from PDF as images and use GPT-4o vision API: `for page in pdf.pages: image = page.to_image(); response = openai.chat.completions.create(model='gpt-4o', messages=[{'role': 'user', 'content': [{'type': 'image_url', 'image_url': {'url': image_base64}}]}]); text = response.choices[0].message.content`: bypasses character encoding entirely.

✓

Prevention

At document ingestion, validate extracted text quality before downstream processing: check for excessive replacement characters (U+FFFD), null bytes, or entropy > expected language baseline. Log failures with original PDF filename and implement a three-tier extraction strategy: (1) pdfplumber for structured PDFs, (2) pypdf layout mode for custom fonts, (3) OCR for scanned. Use Unstructured.io library which auto-selects the best extraction strategy per document type.

Python 3.9+ · pypdf >=3.0.0 · tested on 4.2.x

Verified 2026-04 · gpt-4o

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.