ValueError
sentence_transformers.SentenceTransformer.tokenize.ValueError
Stack trace
ValueError: Token indices sequence length is longer than the specified maximum sequence length for this model (512 > 512). Please reduce the sequence length.
Why it happens
Sentence-transformers models have a fixed maximum token length (commonly 512 tokens). When input text exceeds this limit, the tokenizer raises a ValueError to prevent model input overflow. This often occurs when long documents or concatenated texts are passed without truncation.
Detection
Monitor input lengths before tokenization and catch ValueError exceptions from the tokenizer to log offending inputs and prevent crashes.
Causes & fixes
Input text length exceeds the model's maximum token length (e.g., >512 tokens).
Truncate or chunk input texts to fit within the model's max token length before tokenization.
Concatenating multiple documents or passages without splitting causes token length overflow.
Split long documents into smaller passages or use a sliding window approach to keep token length under the limit.
Using a model with a smaller max token length than expected without adjusting inputs.
Verify the model's max token length and adjust preprocessing accordingly or switch to a model supporting longer sequences.
Code: broken vs fixed
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
long_text = """Very long text exceeding 512 tokens..."""
embeddings = model.encode(long_text) # This line raises ValueError import os
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
os.environ['SENTENCE_TRANSFORMERS_CACHE'] = '/tmp/st_cache' # Example env var usage
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
long_text = """Very long text exceeding 512 tokens..."""
# Truncate input to max length
max_length = tokenizer.model_max_length
inputs = tokenizer(long_text, truncation=True, max_length=max_length)
truncated_text = tokenizer.decode(inputs['input_ids'], skip_special_tokens=True)
embeddings = model.encode(truncated_text) # Fixed: input truncated to max token length
print('Embedding shape:', embeddings.shape) Workaround
Catch the ValueError exception during encoding, then manually truncate the input text to the model's max token length using the tokenizer before retrying the encode call.
Prevention
Implement input length checks and automatic truncation or chunking in your preprocessing pipeline to guarantee inputs never exceed the model's max token length.