ValueError
transformers.tokenization_utils_base.ValueError
Stack trace
ValueError: This tokenizer does not have a pad token. Please set a pad token using tokenizer.pad_token = '<PAD>' or specify pad_token_id.
Why it happens
HuggingFace tokenizers require a pad token to pad input sequences to the same length. If the tokenizer model does not have a pad token set by default, attempts to pad sequences will raise this error. This often happens with models that do not include a pad token in their vocabulary or when the pad token is not explicitly assigned.
Detection
Check if tokenizer.pad_token is None before padding inputs, or catch ValueError during tokenization and log the tokenizer configuration to detect missing pad tokens early.
Causes & fixes
The tokenizer model does not have a default pad token set.
Manually assign a pad token string to tokenizer.pad_token, e.g., tokenizer.pad_token = '[PAD]'.
Using a tokenizer for a model that inherently lacks a pad token (e.g., GPT-2).
Set the pad token to the EOS token or another unused token: tokenizer.pad_token = tokenizer.eos_token.
Loading a tokenizer without specifying padding parameters or pad token id.
Specify pad_token_id explicitly when calling tokenizer.pad or tokenizer.__call__ with padding.
Code: broken vs fixed
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
inputs = tokenizer(['Hello', 'Hello world'], padding=True) # This line raises ValueError import os
from transformers import AutoTokenizer
# No hardcoded keys needed here
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token # Set pad token to eos token to fix error
inputs = tokenizer(['Hello', 'Hello world'], padding=True)
print(inputs) Workaround
Catch the ValueError during tokenization and manually pad sequences by adding a pad token string to inputs before tokenization or use tokenizer.pad_token_id with a custom padding implementation.
Prevention
Always verify that the tokenizer has a pad token set before using padding. For models without a default pad token, explicitly assign one during tokenizer initialization or configuration.