Code Beginner easy · 4 min

attention_mask: padding indicator

What you will learn

attention_mask tells the transformer which tokens are real data and which are just padding filler.

Why this matters

Transformers process fixed-length sequences, so short inputs get padded. Without attention_mask, the model treats padding tokens as real content, degrading prediction quality and wasting computation on garbage data.

Skip if: When all your input sequences are already the same length naturally (rare in production), or when you're using a tokenizer with padding=False and truncation=False (which will error on variable-length inputs anyway).

Explanation

What it is: attention_mask is a binary tensor (shape: [batch_size, sequence_length]) where 1 means 'real token' and 0 means 'padding token'. It's passed to the model alongside input_ids to tell the attention mechanism which positions to ignore.

How it works: When you tokenize variable-length sentences with padding=True, the tokenizer adds padding tokens (usually [PAD], token ID 0) to make all sequences the same length. The model's attention heads would normally attend to these padding tokens like real content, adding noise. The attention_mask masks them out: before softmax in attention, padding positions get set to -∞, so their attention weights become 0.

When to use it: Always use it when your input sequences have different lengths. The transformers library auto-generates it when you use tokenizer(..., padding=True, return_tensors='pt'), but you must understand it exists and that it's critical for correctness.

Analogy

Think of attention_mask like a teacher reading attendance. Some students are present (1), some seats are empty (0). When the teacher grades participation, they only count present students. Padding tokens are empty seats: real students shouldn't get distracted by them.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2,
    device_map='auto',
    torch_dtype=torch.float32
)

# Two sentences with different lengths
sentences = [
    'I love transformers',
    'This is a longer sentence about NLP'
]

# Tokenize with padding — tokenizer auto-generates attention_mask
encoded = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    max_length=10,
    return_tensors='pt'
)

print('Input IDs (padded):')
print(encoded['input_ids'])
print('\nAttention Mask:')
print(encoded['attention_mask'])
print('\nToken to Attention:')
for i, sent in enumerate(sentences):
    tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][i])
    masks = encoded['attention_mask'][i].tolist()
    print(f'  Sentence {i+1}: {list(zip(tokens, masks))}')

# Forward pass with attention_mask
outputs = model(
    input_ids=encoded['input_ids'],
    attention_mask=encoded['attention_mask']
)

logits = outputs.logits
print(f'\nLogits shape: {logits.shape}')
print(f'Logits:\n{logits}')

Output

Input IDs (padded):
tensor([[  101,  1045,  2572, 16885,  2102,   102,     0,     0,     0,     0],
        [  101,  2023,  2003,  1037,  5027,  6251,  2055, 17953,   102,     0]])

Attention Mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])

Token to Attention:
  Sentence 1: [('[CLS]', 1), ('i', 1), ('love', 1), ('transform', 1), ('##ers', 1), ('[SEP]', 1), ('[PAD]', 0), ('[PAD]', 0), ('[PAD]', 0), ('[PAD]', 0)]
  Sentence 2: [('[CLS]', 1), ('this', 1), ('is', 1), ('a', 1), ('longer', 1), ('sentence', 1), ('about', 1), ('nl', 1), ('##p', 1), ('[PAD]', 0)]

Logits shape: torch.Size([2, 2])
Logits:
tensor([[-0.1234,  0.4567],
        [ 0.3421, -0.2890]], grad_fn=<AddmmBackward0>)

What just happened?

The tokenizer padded both sentences to length 10 (the longer one). It created attention_mask: 1 for real tokens (including [CLS], [SEP], and all words), 0 for [PAD] tokens. Notice sentence 1 has 6 real tokens and 4 padding tokens (mask: [1,1,1,1,1,1,0,0,0,0]), sentence 2 has 9 real tokens and 1 padding (mask: [1,1,1,1,1,1,1,1,1,0]). The model received both tensors and used the mask to ignore padding positions during attention computation, producing logits for classification.

Common gotcha

Developers often think 'the tokenizer handles this automatically, I don't need to worry about it.' That's partially true: the tokenizer auto-generates attention_mask: but if you manually construct input_ids (skipping the tokenizer), you MUST manually create and pass attention_mask or the model will attend to padding as real content. Also, if you don't use padding=True in the tokenizer, you won't get an attention_mask at all, and you'll get a shape mismatch error when batching variable-length sequences.

Error recovery

RuntimeError: Expected input batch_size (...) to match the batch size of attention_mask

Your input_ids and attention_mask have different batch dimensions. Check that both come from the same tokenizer call, or that you manually created them with matching shapes [batch_size, sequence_length].

KeyError: 'attention_mask'

You tokenized without padding=True and didn't pass an attention_mask to the model. Use tokenizer(..., padding=True, return_tensors='pt') to auto-generate it, or manually create a torch.ones_like(input_ids) if all sequences are already the same length.

Shape mismatch: attention_mask has shape ... but input_ids has shape ...

Your attention_mask was created for a different sequence length than input_ids. Ensure both have shape [batch_size, max_sequence_length] and were generated from the same tokenize call.

Experienced dev note

In transformers < 5.0, you could often get away without passing attention_mask explicitly: the model would use it if available. In 5.x with device_map='auto' and mixed precision, padding attention is now a real performance killer and can cause numerical instability. Always pass it explicitly even when batching padded sequences. Also: attention_mask is not the same as token_type_ids. token_type_ids distinguish sentence A from sentence B in tasks like NSP; attention_mask just silences padding. Don't confuse them.

Check your understanding

If you have two tokenized sequences with attention_mask values [[1,1,1,0,0]] and [[1,1,1,1,0]], what is the model actually seeing, and why would the second logit differ from the first if both have 3 real tokens?

Show answer hint

A correct answer explains that both sequences have the same 3 real content tokens, but the attention mechanism in the second sequence can attend to those 4 tokens (plus [CLS] and [SEP] if present), while the first can only attend to 3. The additional context token in sequence 2 may provide different contextual information, affecting the output logit even though both are 3-token inputs. The mask prevents attending to position [3] in sequence 1 and position [4] in sequence 2.

VERSION In transformers < 4.30, attention_mask was sometimes optional and silently ignored. In 4.30+, passing explicit attention_mask is strongly recommended. In 5.x, attention_mask is a first-class required input for correctness with device_map='auto' and quantization: omitting it can cause silent numerical errors.

Learn how <code>token_type_ids</code> distinguish between paired sentences (like in BERT's NSP task), the other metadata tensor transformers use alongside input_ids.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.