High severity intermediate · Fix: 5-10 min

UnicodeDecodeError

builtins.UnicodeDecodeError (UTF-8 encoding mismatch in document text extraction)

What this error means

LangChain's document loaders fail with UnicodeDecodeError when a file contains non-UTF-8 encoded text (Latin-1, Windows-1252, Big5, etc.), or when a PDF extractor encounters corrupted/mixed encoding.

Stack trace

traceback

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 142: invalid continuation byte
  File "/path/to/site-packages/langchain_community/document_loaders/text.py", line 45, in load
    with open(file_path, 'r', encoding='utf-8') as f:
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 142: invalid continuation byte

QUICK FIX

Use chardet to auto-detect file encoding, then pass encoding parameter to TextLoader: chardet.detect(open(file, 'rb').read()); TextLoader(file, encoding=detected_encoding).

Why it happens

LangChain document loaders default to UTF-8 encoding when reading text files, but many real-world documents use legacy encodings like Latin-1 (ISO-8859-1), Windows-1252, or Big5 Chinese. PDFs may also contain mixed encodings or corrupted text streams. When the loader tries to decode these bytes as UTF-8, Python raises UnicodeDecodeError. This is especially common when loading documents from older systems, non-English sources, or untrusted data sources.

Detection

Wrap document loader calls in try/except UnicodeDecodeError and log the file path and encoding detection result (use chardet library to identify actual encoding). Monitor file loads in production and alert on encoding errors to catch problematic documents early.

Causes & fixes

TextLoader defaults to UTF-8 but file uses Latin-1 (ISO-8859-1) or Windows-1252

✓ Fix

Pass encoding parameter explicitly: TextLoader(file_path, encoding='latin-1') or detect with chardet.detect() before loading

PDF contains scanned images with OCR text in non-UTF-8 encoding, or corrupted text stream

✓ Fix

Use PyPDFLoader with error handling, or switch to pdfplumber with fallback to image-based OCR (pytesseract/GPT-4o vision) for scanned PDFs

Multi-encoding document (some pages UTF-8, others Latin-1) causing failure mid-parse

✓ Fix

Use errors='replace' or errors='ignore' in file open: open(path, encoding='utf-8', errors='replace'), or implement per-page/per-chunk encoding detection

CSV or JSON loader trying to read file with wrong declared encoding in file metadata

✓ Fix

Use CSVLoader(file_path, encoding='utf-8', csv_encoding='latin-1') or detect encoding with chardet before passing to loader

Code: broken vs fixed

Broken - triggers the error

python

import os
from langchain_community.document_loaders import TextLoader

file_path = 'documents/report.txt'  # Contains Latin-1 encoded text
loader = TextLoader(file_path)  # ❌ Defaults to UTF-8, will crash on Latin-1
docs = loader.load()
print(f'Loaded {len(docs)} documents')

Fixed - works correctly

python

import os
import chardet
from langchain_community.document_loaders import TextLoader

file_path = 'documents/report.txt'

# ✅ Detect actual encoding before loading
with open(file_path, 'rb') as f:
    raw_bytes = f.read()
    detected = chardet.detect(raw_bytes)
    encoding = detected['encoding'] or 'utf-8'

loader = TextLoader(file_path, encoding=encoding)  # ✅ Use detected encoding
docs = loader.load()
print(f'Loaded {len(docs)} documents with encoding: {encoding}')

Added chardet to detect the file's actual encoding before passing it to TextLoader, preventing UnicodeDecodeError by using the correct encoding parameter instead of assuming UTF-8.

⚠

Workaround

Wrap the loader in try/except UnicodeDecodeError and fall back to errors='replace' mode: try the load, catch the error, then open with encoding='utf-8' and errors='replace' to strip undecodable bytes; parse the sanitized text manually with json.loads() or parse_document().

✓

Prevention

Standardize document ingestion pipeline: (1) detect encoding with chardet on all incoming files, (2) implement per-loader encoding parameter in your LangChain wrapper class, (3) for PDFs use pdfplumber or unstructured.io which handle encoding internally, (4) sanitize and validate text encoding at ingestion time before storing in vector DB.

Python 3.9+ · langchain-community >=0.0.1 · tested on 0.1.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.