High severity intermediate · Fix: 5-10 min

ValueError

sentence_transformers.SentenceTransformer.tokenize.ValueError

What this error means

The sentence-transformers tokenizer received input text exceeding its maximum token length, causing a ValueError during tokenization.

Stack trace

traceback

ValueError: Token indices sequence length is longer than the specified maximum sequence length for this model (512 > 512). Please reduce the sequence length.

QUICK FIX

Truncate input texts to 512 tokens or less before passing to the sentence-transformers tokenizer.

Why it happens

Sentence-transformers models have a fixed maximum token length (commonly 512 tokens). When input text exceeds this limit, the tokenizer raises a ValueError to prevent model input overflow. This often occurs when long documents or concatenated texts are passed without truncation.

Detection

Monitor input lengths before tokenization and catch ValueError exceptions from the tokenizer to log offending inputs and prevent crashes.

Causes & fixes

Input text length exceeds the model's maximum token length (e.g., >512 tokens).

✓ Fix

Truncate or chunk input texts to fit within the model's max token length before tokenization.

Concatenating multiple documents or passages without splitting causes token length overflow.

✓ Fix

Split long documents into smaller passages or use a sliding window approach to keep token length under the limit.

Using a model with a smaller max token length than expected without adjusting inputs.

✓ Fix

Verify the model's max token length and adjust preprocessing accordingly or switch to a model supporting longer sequences.

Code: broken vs fixed

Broken - triggers the error

python

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

long_text = """Very long text exceeding 512 tokens..."""
embeddings = model.encode(long_text)  # This line raises ValueError

Fixed - works correctly

python

import os
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer

os.environ['SENTENCE_TRANSFORMERS_CACHE'] = '/tmp/st_cache'  # Example env var usage

model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

long_text = """Very long text exceeding 512 tokens..."""

# Truncate input to max length
max_length = tokenizer.model_max_length
inputs = tokenizer(long_text, truncation=True, max_length=max_length)
truncated_text = tokenizer.decode(inputs['input_ids'], skip_special_tokens=True)

embeddings = model.encode(truncated_text)  # Fixed: input truncated to max token length
print('Embedding shape:', embeddings.shape)

Added explicit tokenization with truncation to ensure input text does not exceed the model's max token length, preventing ValueError during encoding.

⚠

Workaround

Catch the ValueError exception during encoding, then manually truncate the input text to the model's max token length using the tokenizer before retrying the encode call.

✓

Prevention

Implement input length checks and automatic truncation or chunking in your preprocessing pipeline to guarantee inputs never exceed the model's max token length.

Python 3.9+ · sentence-transformers >=2.0.0 · tested on 2.2.2

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.