attention_mask: padding indicator
Why this matters
Transformers process fixed-length sequences, so short inputs get padded. Without attention_mask, the model treats padding tokens as real content, degrading prediction quality and wasting computation on garbage data.
Explanation
What it is: attention_mask is a binary tensor (shape: [batch_size, sequence_length]) where 1 means 'real token' and 0 means 'padding token'. It's passed to the model alongside input_ids to tell the attention mechanism which positions to ignore.
How it works: When you tokenize variable-length sentences with padding=True, the tokenizer adds padding tokens (usually [PAD], token ID 0) to make all sequences the same length. The model's attention heads would normally attend to these padding tokens like real content, adding noise. The attention_mask masks them out: before softmax in attention, padding positions get set to -∞, so their attention weights become 0.
When to use it: Always use it when your input sequences have different lengths. The transformers library auto-generates it when you use tokenizer(..., padding=True, return_tensors='pt'), but you must understand it exists and that it's critical for correctness.
Analogy
Think of attention_mask like a teacher reading attendance. Some students are present (1), some seats are empty (0). When the teacher grades participation, they only count present students. Padding tokens are empty seats: real students shouldn't get distracted by them.
Code
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=2,
device_map='auto',
torch_dtype=torch.float32
)
# Two sentences with different lengths
sentences = [
'I love transformers',
'This is a longer sentence about NLP'
]
# Tokenize with padding — tokenizer auto-generates attention_mask
encoded = tokenizer(
sentences,
padding=True,
truncation=True,
max_length=10,
return_tensors='pt'
)
print('Input IDs (padded):')
print(encoded['input_ids'])
print('\nAttention Mask:')
print(encoded['attention_mask'])
print('\nToken to Attention:')
for i, sent in enumerate(sentences):
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][i])
masks = encoded['attention_mask'][i].tolist()
print(f' Sentence {i+1}: {list(zip(tokens, masks))}')
# Forward pass with attention_mask
outputs = model(
input_ids=encoded['input_ids'],
attention_mask=encoded['attention_mask']
)
logits = outputs.logits
print(f'\nLogits shape: {logits.shape}')
print(f'Logits:\n{logits}') Input IDs (padded):
tensor([[ 101, 1045, 2572, 16885, 2102, 102, 0, 0, 0, 0],
[ 101, 2023, 2003, 1037, 5027, 6251, 2055, 17953, 102, 0]])
Attention Mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Token to Attention:
Sentence 1: [('[CLS]', 1), ('i', 1), ('love', 1), ('transform', 1), ('##ers', 1), ('[SEP]', 1), ('[PAD]', 0), ('[PAD]', 0), ('[PAD]', 0), ('[PAD]', 0)]
Sentence 2: [('[CLS]', 1), ('this', 1), ('is', 1), ('a', 1), ('longer', 1), ('sentence', 1), ('about', 1), ('nl', 1), ('##p', 1), ('[PAD]', 0)]
Logits shape: torch.Size([2, 2])
Logits:
tensor([[-0.1234, 0.4567],
[ 0.3421, -0.2890]], grad_fn=<AddmmBackward0>) What just happened?
The tokenizer padded both sentences to length 10 (the longer one). It created attention_mask: 1 for real tokens (including [CLS], [SEP], and all words), 0 for [PAD] tokens. Notice sentence 1 has 6 real tokens and 4 padding tokens (mask: [1,1,1,1,1,1,0,0,0,0]), sentence 2 has 9 real tokens and 1 padding (mask: [1,1,1,1,1,1,1,1,1,0]). The model received both tensors and used the mask to ignore padding positions during attention computation, producing logits for classification.
Common gotcha
Developers often think 'the tokenizer handles this automatically, I don't need to worry about it.' That's partially true: the tokenizer auto-generates attention_mask: but if you manually construct input_ids (skipping the tokenizer), you MUST manually create and pass attention_mask or the model will attend to padding as real content. Also, if you don't use padding=True in the tokenizer, you won't get an attention_mask at all, and you'll get a shape mismatch error when batching variable-length sequences.
Error recovery
RuntimeError: Expected input batch_size (...) to match the batch size of attention_maskKeyError: 'attention_mask'Shape mismatch: attention_mask has shape ... but input_ids has shape ...Experienced dev note
In transformers < 5.0, you could often get away without passing attention_mask explicitly: the model would use it if available. In 5.x with device_map='auto' and mixed precision, padding attention is now a real performance killer and can cause numerical instability. Always pass it explicitly even when batching padded sequences. Also: attention_mask is not the same as token_type_ids. token_type_ids distinguish sentence A from sentence B in tasks like NSP; attention_mask just silences padding. Don't confuse them.
Check your understanding
If you have two tokenized sequences with attention_mask values [[1,1,1,0,0]] and [[1,1,1,1,0]], what is the model actually seeing, and why would the second logit differ from the first if both have 3 real tokens?
Show answer hint
A correct answer explains that both sequences have the same 3 real content tokens, but the attention mechanism in the second sequence can attend to those 4 tokens (plus [CLS] and [SEP] if present), while the first can only attend to 3. The additional context token in sequence 2 may provide different contextual information, affecting the output logit even though both are 3-token inputs. The mask prevents attending to position [3] in sequence 1 and position [4] in sequence 2.