ValueError
transformers.tokenization_utils_base.TruncationOverflowError
Stack trace
ValueError: Token indices sequence length is longer than the specified maximum sequence length for this model (1025 > 1024). Running this sequence through the model will result in indexing errors
Why it happens
HuggingFace tokenizers enforce a maximum token length limit defined by the model architecture. When input text exceeds this limit without proper truncation, the tokenizer raises a ValueError to prevent invalid model inputs.
Detection
Monitor input text length before tokenization or catch ValueError exceptions during tokenization to detect when inputs exceed the model's max token length.
Causes & fixes
Input text length exceeds the model's maximum token length without truncation enabled
Enable truncation in the tokenizer call by setting truncation=True or manually truncate input text before tokenization.
Using a tokenizer with default max_length set too low or not aligned with the model's max input size
Explicitly set tokenizer's max_length parameter to the model's maximum supported length or use tokenizer.model_max_length.
Passing very long documents or concatenated texts without chunking or splitting
Split or chunk long texts into smaller segments that fit within the tokenizer's max length before embedding.
Code: broken vs fixed
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "a" * 2000
# This line raises ValueError due to input length
tokens = tokenizer(text)['input_ids'] import os
from transformers import AutoTokenizer
os.environ['HF_HOME'] = '/tmp/hf_cache' # Example environment setup
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "a" * 2000
# Fix: enable truncation to avoid max length error
tokens = tokenizer(text, truncation=True)['input_ids']
print(f'Tokenized length: {len(tokens)}') Workaround
Catch the ValueError exception, then manually truncate the input text to the tokenizer's max_length before retrying tokenization.
Prevention
Always use tokenizer truncation or chunk long texts before tokenization to ensure inputs never exceed the model's maximum token length.