ValueError
langchain.text_splitter.TokenTextSplitter.ValueError
Stack trace
ValueError: Encoding 'cl100k_base' not found. Please check the encoding name or install the required tokenizer package.
File ".../langchain/text_splitter.py", line 123, in __init__
self.tokenizer = tiktoken.get_encoding(encoding_name)
File ".../tiktoken/__init__.py", line 45, in get_encoding
raise ValueError(f"Encoding '{name}' not found.") Why it happens
TokenTextSplitter relies on the tiktoken library to tokenize text based on a specified encoding name. If the encoding name is misspelled, unsupported, or the tiktoken package is missing or outdated, it raises this ValueError. This usually happens when the encoding parameter is incorrect or the environment lacks the required tokenizer data.
Detection
Catch ValueError exceptions when initializing TokenTextSplitter or calling its methods, and log the encoding name used to identify unsupported or misspelled encodings before the app crashes.
Causes & fixes
The encoding name passed to TokenTextSplitter is misspelled or invalid.
Verify and correct the encoding name string to a valid encoding supported by tiktoken, such as 'cl100k_base' or 'r50k_base'.
The tiktoken package is not installed or is outdated, missing the requested encoding.
Install or upgrade tiktoken to the latest version using 'pip install -U tiktoken' to ensure all encodings are available.
Using a custom or unsupported encoding name not included in the tiktoken library.
Switch to a standard encoding supported by tiktoken or implement a custom tokenizer compatible with TokenTextSplitter.
Code: broken vs fixed
from langchain.text_splitter import TokenTextSplitter
# This will raise ValueError if encoding is invalid
splitter = TokenTextSplitter(encoding_name='cl100k_base_wrong') # triggers error
chunks = splitter.split_text("Some long text to split.") from langchain.text_splitter import TokenTextSplitter
# Fixed: corrected encoding name
splitter = TokenTextSplitter(encoding_name='cl100k_base') # corrected encoding
chunks = splitter.split_text("Some long text to split.")
print(chunks) Workaround
Wrap TokenTextSplitter initialization in try/except ValueError, and fallback to a simpler text splitter like CharacterTextSplitter if encoding is not found.
Prevention
Always verify encoding names against the tiktoken documentation and keep the tiktoken package updated to avoid missing encodings in chunking workflows.