UnicodeDecodeError
builtins.UnicodeDecodeError
Stack trace
Traceback (most recent call last):
File "app.py", line 42, in <module>
documents = loader.load()
File "/usr/local/lib/python3.9/site-packages/langchain/document_loaders/text.py", line 58, in load
with open(self.file_path, encoding=self.encoding) as f:
File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Why it happens
DocumentLoader defaults to UTF-8 encoding when reading files. If the file is encoded in a different charset (e.g., Latin-1, Windows-1252, or binary formats), Python raises UnicodeDecodeError because it cannot decode bytes that don't conform to UTF-8.
Detection
Monitor logs for UnicodeDecodeError exceptions during document loading and add validation to detect file encoding before loading to prevent crashes.
Causes & fixes
File is encoded in a non-UTF-8 encoding but loader tries to decode as UTF-8
Specify the correct encoding explicitly when initializing the DocumentLoader, e.g., encoding='latin-1' or 'windows-1252'.
Trying to load a binary or non-text file as text
Ensure the DocumentLoader is used only with text files or use a binary-safe loader for non-text documents.
File contains BOM (Byte Order Mark) or special characters not handled by default encoding
Use encoding='utf-8-sig' to handle BOM or preprocess the file to remove BOM before loading.
Code: broken vs fixed
from langchain.document_loaders import TextLoader
loader = TextLoader("data/document.txt")
documents = loader.load() # Raises UnicodeDecodeError if encoding mismatches import os
from langchain.document_loaders import TextLoader
# Set environment variable for demonstration (replace with your actual key if needed)
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")
loader = TextLoader("data/document.txt", encoding="latin-1") # Fixed: specify correct encoding
documents = loader.load()
print(f"Loaded {len(documents)} documents successfully.") Workaround
Wrap the load call in try/except UnicodeDecodeError, then attempt to reload the file with a fallback encoding like 'latin-1' or 'utf-8-sig' to recover from encoding issues.
Prevention
Detect file encoding before loading using libraries like chardet or charset-normalizer and always specify encoding explicitly in DocumentLoader to avoid decode errors.