UnicodeDecodeError
builtins.UnicodeDecodeError (UTF-8 encoding mismatch in document text extraction)
Stack trace
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 142: invalid continuation byte
File "/path/to/site-packages/langchain_community/document_loaders/text.py", line 45, in load
with open(file_path, 'r', encoding='utf-8') as f:
File "/usr/lib/python3.9/codecs.py", line 322, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 142: invalid continuation byte Why it happens
LangChain document loaders default to UTF-8 encoding when reading text files, but many real-world documents use legacy encodings like Latin-1 (ISO-8859-1), Windows-1252, or Big5 Chinese. PDFs may also contain mixed encodings or corrupted text streams. When the loader tries to decode these bytes as UTF-8, Python raises UnicodeDecodeError. This is especially common when loading documents from older systems, non-English sources, or untrusted data sources.
Detection
Wrap document loader calls in try/except UnicodeDecodeError and log the file path and encoding detection result (use chardet library to identify actual encoding). Monitor file loads in production and alert on encoding errors to catch problematic documents early.
Causes & fixes
TextLoader defaults to UTF-8 but file uses Latin-1 (ISO-8859-1) or Windows-1252
Pass encoding parameter explicitly: TextLoader(file_path, encoding='latin-1') or detect with chardet.detect() before loading
PDF contains scanned images with OCR text in non-UTF-8 encoding, or corrupted text stream
Use PyPDFLoader with error handling, or switch to pdfplumber with fallback to image-based OCR (pytesseract/GPT-4o vision) for scanned PDFs
Multi-encoding document (some pages UTF-8, others Latin-1) causing failure mid-parse
Use errors='replace' or errors='ignore' in file open: open(path, encoding='utf-8', errors='replace'), or implement per-page/per-chunk encoding detection
CSV or JSON loader trying to read file with wrong declared encoding in file metadata
Use CSVLoader(file_path, encoding='utf-8', csv_encoding='latin-1') or detect encoding with chardet before passing to loader
Code: broken vs fixed
import os
from langchain_community.document_loaders import TextLoader
file_path = 'documents/report.txt' # Contains Latin-1 encoded text
loader = TextLoader(file_path) # ❌ Defaults to UTF-8, will crash on Latin-1
docs = loader.load()
print(f'Loaded {len(docs)} documents') import os
import chardet
from langchain_community.document_loaders import TextLoader
file_path = 'documents/report.txt'
# ✅ Detect actual encoding before loading
with open(file_path, 'rb') as f:
raw_bytes = f.read()
detected = chardet.detect(raw_bytes)
encoding = detected['encoding'] or 'utf-8'
loader = TextLoader(file_path, encoding=encoding) # ✅ Use detected encoding
docs = loader.load()
print(f'Loaded {len(docs)} documents with encoding: {encoding}') Workaround
Wrap the loader in try/except UnicodeDecodeError and fall back to errors='replace' mode: try the load, catch the error, then open with encoding='utf-8' and errors='replace' to strip undecodable bytes; parse the sanitized text manually with json.loads() or parse_document().
Prevention
Standardize document ingestion pipeline: (1) detect encoding with chardet on all incoming files, (2) implement per-loader encoding parameter in your LangChain wrapper class, (3) for PDFs use pdfplumber or unstructured.io which handle encoding internally, (4) sanitize and validate text encoding at ingestion time before storing in vector DB.