TokenizerError
tokenizers.TokenizerError
Stack trace
tokenizers.TokenizerError: sequence too long
at tokenizers.Tokenizer.encode(...)
File "example.py", line 12, in <module>
encoded = tokenizer.encode(long_text) # triggers TokenizerError
Why it happens
Huggingface tokenizers enforce a maximum token length limit based on the model's architecture. When input text is too long and no truncation is applied, the tokenizer raises TokenizerError to prevent invalid inputs that the model cannot handle.
Detection
Catch tokenizers.TokenizerError exceptions during tokenization and log the input length to identify inputs exceeding the model's max token limit before crashing.
Causes & fixes
Input text length exceeds the tokenizer's max token limit without truncation enabled
Enable truncation in the tokenizer call by setting truncation=True or manually truncate input text before tokenization
Using a tokenizer with a smaller max length than the input requires
Switch to a tokenizer/model with a larger max token capacity or split input into smaller chunks before tokenizing
Passing raw text directly without preprocessing or chunking for long documents
Preprocess input by chunking long documents into smaller segments that fit within the tokenizer's max length
Code: broken vs fixed
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
long_text = 'a' * 10000
encoded = tokenizer.encode(long_text) # triggers TokenizerError
print(encoded) import os
from transformers import AutoTokenizer
os.environ['HF_HOME'] = '/tmp/hf_cache' # example env usage
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
long_text = 'a' * 10000
encoded = tokenizer.encode(long_text, truncation=True) # fixed: enable truncation
print(encoded) Workaround
Wrap tokenization in try/except TokenizerError, catch the exception, then manually truncate the input string to a safe length before retrying tokenization.
Prevention
Always use tokenizer calls with truncation=True or preprocess inputs by chunking to ensure sequences never exceed the model's max token length limit.