garbled_text_output
pypdf.PdfReader.extract_text(): encoding/character output error
Stack trace
No exception raised — text extracts successfully but contains garbled output:
>>> from pypdf import PdfReader
>>> reader = PdfReader('document.pdf')
>>> text = reader.pages[0].extract_text()
>>> print(text)
'\x00H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d' # or similar mojibake
# OR
'àáâãäå Âçȶ£' # random Unicode replacement characters
# OR
'???? ???? ????' # replacement character U+FFFD where text should be Why it happens
PDFs embed text in multiple ways: some store actual Unicode text, others use font-specific character mappings (like custom fonts or CID fonts). pypdf's extract_text() tries to decode these mappings automatically, but fails when: (1) the PDF uses a proprietary font without embedded character maps, (2) text is stored as CID (Composite Identity) without a ToUnicode mapping, or (3) the PDF was created with broken encoding metadata. pypdf then falls back to byte sequences or placeholder characters instead of readable text.
Detection
After extraction, check the output for replacement character U+FFFD ('?'), null bytes ('\x00'), or non-ASCII sequences that don't match your document's language. Add text validation before processing: `if '\ufffd' in text or '\x00' in text: log_warning('Possible encoding issue detected')`.
Causes & fixes
PDF uses custom/embedded fonts without ToUnicode character mapping
Switch to pdfplumber or pymupdf (fitz) which have better font encoding handling, or use OCR fallback via pytesseract/Tesseract when pypdf output fails validation
Text stored as CID (Composite Identity) strings instead of Unicode
Enable pypdf's advanced extraction: use `extract_text(extraction_mode='layout')` or try `extract_pages()` with manual layout analysis to recover word order
PDF metadata lists wrong encoding or character set
Try pypdf's `visitor_text` parameter or switch to pdfplumber which re-analyzes character boxes independently of metadata
Scanned PDF (image-based): pypdf cannot extract text from images
Use Tesseract OCR (`pytesseract`) or AWS Textract, or convert PDF to images and pass to GPT-4o vision API for text recognition
Code: broken vs fixed
import os
from pypdf import PdfReader
pdf_path = 'document.pdf'
reader = PdfReader(pdf_path)
# This may return garbled text for PDFs with custom fonts
text = reader.pages[0].extract_text() # <-- Line that produces garbled output
print(text)
print(repr(text)) # Shows mojibake: '\x00H\x00e\x00l\x00l\x00o' or 'àáâã' import os
from pypdf import PdfReader
import pdfplumber
pdf_path = 'document.pdf'
# Strategy 1: Try pdfplumber first (better encoding handling)
try:
with pdfplumber.open(pdf_path) as pdf:
text = pdf.pages[0].extract_text()
# Validate extraction quality
if '\ufffd' in text or len(text.strip()) < 5:
raise ValueError('Low quality extraction — fallback to OCR')
print('Extracted (pdfplumber):', text)
except Exception as e:
print(f'pdfplumber failed: {e} — trying pypdf with layout mode')
# Strategy 2: pypdf with layout extraction
reader = PdfReader(pdf_path)
text = reader.pages[0].extract_text(extraction_mode='layout')
if '\ufffd' not in text and len(text.strip()) > 5:
print('Extracted (pypdf layout):', text)
else:
# Strategy 3: Fall back to OCR for scanned/complex PDFs
print('Resorting to OCR for this PDF...')
try:
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path(pdf_path, first_page=1, last_page=1)
text = pytesseract.image_to_string(images[0])
print('Extracted (OCR):', text)
except ImportError:
print('Install pytesseract and pdf2image for OCR: pip install pytesseract pdf2image') Workaround
Extract text from PDF as images and use GPT-4o vision API: `for page in pdf.pages: image = page.to_image(); response = openai.chat.completions.create(model='gpt-4o', messages=[{'role': 'user', 'content': [{'type': 'image_url', 'image_url': {'url': image_base64}}]}]); text = response.choices[0].message.content`: bypasses character encoding entirely.
Prevention
At document ingestion, validate extracted text quality before downstream processing: check for excessive replacement characters (U+FFFD), null bytes, or entropy > expected language baseline. Log failures with original PDF filename and implement a three-tier extraction strategy: (1) pdfplumber for structured PDFs, (2) pypdf layout mode for custom fonts, (3) OCR for scanned. Use Unstructured.io library which auto-selects the best extraction strategy per document type.