High severity beginner · Fix: 2-5 min

UnicodeDecodeError

builtins.UnicodeDecodeError

What this error means

DocumentLoader fails to decode a file due to incorrect or missing encoding specification, causing a UnicodeDecodeError.

Stack trace

traceback

Traceback (most recent call last):
  File "app.py", line 42, in <module>
    documents = loader.load()
  File "/usr/local/lib/python3.9/site-packages/langchain/document_loaders/text.py", line 58, in load
    with open(self.file_path, encoding=self.encoding) as f:
  File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

QUICK FIX

Pass the correct encoding parameter (e.g., encoding='latin-1') to the DocumentLoader constructor to match the file's encoding.

Why it happens

DocumentLoader defaults to UTF-8 encoding when reading files. If the file is encoded in a different charset (e.g., Latin-1, Windows-1252, or binary formats), Python raises UnicodeDecodeError because it cannot decode bytes that don't conform to UTF-8.

Detection

Monitor logs for UnicodeDecodeError exceptions during document loading and add validation to detect file encoding before loading to prevent crashes.

Causes & fixes

File is encoded in a non-UTF-8 encoding but loader tries to decode as UTF-8

✓ Fix

Specify the correct encoding explicitly when initializing the DocumentLoader, e.g., encoding='latin-1' or 'windows-1252'.

Trying to load a binary or non-text file as text

✓ Fix

Ensure the DocumentLoader is used only with text files or use a binary-safe loader for non-text documents.

File contains BOM (Byte Order Mark) or special characters not handled by default encoding

✓ Fix

Use encoding='utf-8-sig' to handle BOM or preprocess the file to remove BOM before loading.

Code: broken vs fixed

Broken - triggers the error

python

from langchain.document_loaders import TextLoader

loader = TextLoader("data/document.txt")
documents = loader.load()  # Raises UnicodeDecodeError if encoding mismatches

Fixed - works correctly

python

import os
from langchain.document_loaders import TextLoader

# Set environment variable for demonstration (replace with your actual key if needed)
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")

loader = TextLoader("data/document.txt", encoding="latin-1")  # Fixed: specify correct encoding
documents = loader.load()
print(f"Loaded {len(documents)} documents successfully.")

Specified the correct file encoding 'latin-1' in TextLoader to prevent UnicodeDecodeError when reading non-UTF-8 encoded files.

⚠

Workaround

Wrap the load call in try/except UnicodeDecodeError, then attempt to reload the file with a fallback encoding like 'latin-1' or 'utf-8-sig' to recover from encoding issues.

✓

Prevention

Detect file encoding before loading using libraries like chardet or charset-normalizer and always specify encoding explicitly in DocumentLoader to avoid decode errors.

Python 3.9+ · langchain-core >=0.1.0 · tested on 0.2.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.