Padding and truncation: max_length
Why this matters
Transformer models require fixed-size input tensors. Real-world text varies wildly in length: you need to standardize it or your model will crash or produce poor results. This is a prerequisite skill before running any inference or fine-tuning.
Explanation
Padding and truncation are tokenizer operations that force variable-length text into a fixed sequence length. When you set max_length=512, the tokenizer will either add padding tokens (typically [PAD]) to sequences shorter than 512 tokens, or remove tokens from the end of sequences longer than 512 tokens. Mechanically: the tokenizer converts text → tokens → applies truncation (keeps first N tokens, discards the rest) or padding (appends [PAD] tokens). You control this with padding='max_length', truncation=True, and max_length=N parameters. When to use: Always set max_length and truncation=True during inference to match your model's training setup. Use padding='max_length' in batch processing so all sequences in a batch are identical length; use padding='longest' only when processing single examples or when you want to save compute by not padding to a fixed limit.
Analogy
Think of <code>max_length</code> like a standardized form with a fixed number of lines. If your application is short, you fill empty lines with nothing (padding). If your application is too long, you cut it off mid-sentence (truncation). Your form processor doesn't care which case applies: it just expects a form of standard size.
Code
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text_short = 'Hello world'
text_long = 'The quick brown fox jumps over the lazy dog ' * 30
result_short = tokenizer(
text_short,
max_length=20,
padding='max_length',
truncation=True,
return_tensors='pt'
)
result_long = tokenizer(
text_long,
max_length=20,
padding='max_length',
truncation=True,
return_tensors='pt'
)
print('Short text token count:', result_short['input_ids'].shape[1])
print('Short text:', result_short['input_ids'])
print('\nLong text token count:', result_long['input_ids'].shape[1])
print('Long text (first 20 tokens):', result_long['input_ids'])
print('\nAttention mask (short):', result_short['attention_mask'])
print('Attention mask (long):', result_long['attention_mask']) Short text token count: 20
Short text: tensor([[ 101, 7592, 2088, 102, 103, 103, 103, 103, 103, 103, 103, 103,
103, 103, 103, 103, 103, 103, 103, 103]])
Long text token count: 20
Long text (first 20 tokens): tensor([[ 101, 1996, 3613, 2882, 4419, 8814, 2058, 1996, 13971, 3899, 1996,
3613, 2882, 4419, 8814, 2058, 1996, 13971, 3899, 102]])
Attention mask (short): tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attention mask (long): tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) What just happened?
The tokenizer converted two strings (one short, one long) into fixed-length token sequences of 20 tokens each. The short text was padded with token ID 103 ([PAD]) after the sentence-end token (102). The long text was truncated to exactly 20 tokens, losing everything after 'the lazy dog' repeats. The <code>attention_mask</code> shows which positions are real tokens (1) versus padding (0), telling the model to ignore padded positions during computation.
Common gotcha
Developers often forget to set attention_mask in the model call. If you set padding='max_length' but don't pass attention_mask to your model, the model will treat [PAD] tokens as real input, corrupting your results. Always pass both input_ids and attention_mask from the tokenizer output to the model.
Error recovery
RuntimeError: Expected input batch_size (1) to match the input provided to the model (8)Token indices sequence length is longer than the maximum (2048 > 512)ValueError: Padding must be truncated to a multiple of 8Experienced dev note
In transformers 5.5.x, the tokenizer returns attention_mask automatically: you don't opt-in. But many developers still loop through datasets without batching, tokenizing one example at a time. This defeats the purpose of padding: your GPU sits idle. Always tokenize in batches: tokenizer([text1, text2, ...], padding='max_length', ...). Also: never set max_length larger than your model was trained on (e.g., BERT with max_length=2048 will fail: it was trained on 512). Check your model card.
Check your understanding
You tokenize a batch of 8 texts with max_length=128, padding='max_length', truncation=True. Three of them have only 50 tokens after truncation. What does the attention_mask look like for those three, and why does the model need it?
Show answer hint
A correct answer explains that the attention_mask for short texts has 0s in positions 50-128 (the padded region), and the model uses this to ignore [PAD] tokens during attention computation so they don't influence the learned representation.