DataCollatorWithPadding
Why this matters
When training transformers on variable-length sequences, you need consistent tensor shapes for GPU batching. DataCollatorWithPadding eliminates error-prone manual padding code and integrates seamlessly with PyTorch DataLoader, preventing shape mismatches and OOM errors from inefficient padding strategies.
Explanation
DataCollatorWithPadding is a callable object that intercepts a batch of tokenized samples and pads them to the longest sequence in that batch (not the global max length). It handles attention masks, token type IDs, and custom keys automatically, converting Python lists into properly-shaped PyTorch tensors. Mechanically, when you pass a batch dict to it, it identifies the longest input_ids, pads shorter sequences with the tokenizer's pad token ID, and creates a corresponding attention_mask that marks padding positions as 0. This is more efficient than padding all sequences to a fixed global max, because batch padding is dynamic: if a batch happens to have short sequences, they stay short. You typically instantiate it with a tokenizer and pass it to PyTorch's DataLoader as the collate_fn parameter, so it processes each batch before returning to your training loop.
Analogy
Think of it as a smart delivery service: instead of wrapping every package to the size of your largest box (waste), DataCollatorWithPadding measures the tallest package in each truck and pads only to that height. Different trucks (batches) get different padding depending on what's inside.
Code
from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import Dataset
import torch
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
samples = [
{'text': 'Hello world'},
{'text': 'This is a longer sentence for testing'},
{'text': 'Short'},
]
dataset = Dataset.from_dict({
'text': [s['text'] for s in samples]
})
def tokenize_fn(examples):
return tokenizer(
examples['text'],
truncation=True,
max_length=512
)
tokenized_dataset = dataset.map(tokenize_fn, batched=True, remove_columns=['text'])
collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch = collator([tokenized_dataset[0], tokenized_dataset[1], tokenized_dataset[2]])
print('Keys in batch:', batch.keys())
print('Input IDs shape:', batch['input_ids'].shape)
print('Attention mask shape:', batch['attention_mask'].shape)
print('\nInput IDs batch:')
print(batch['input_ids'])
print('\nAttention mask batch:')
print(batch['attention_mask']) Keys in batch: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
Input IDs shape: torch.Size([3, 17])
Attention mask shape: torch.Size([3, 17])
Input IDs batch:
tensor([[ 101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0],
[ 101, 2023, 2003, 1037, 2936, 6251, 2572, 2005, 3231, 102, 0, 0,
0, 0, 0, 0, 0],
[ 101, 2460, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]])
Attention mask batch:
tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) What just happened?
The collator took three tokenized samples of different lengths (4, 10, and 3 tokens respectively after tokenization), identified the longest (10 tokens + special tokens = 17 total), padded all sequences to that length with pad token ID 0, and created an attention_mask that marks real tokens as 1 and padding as 0. All outputs are PyTorch tensors with batch dimension first.
Common gotcha
Developers often forget that DataCollatorWithPadding pads to the longest sequence in the current batch, not to a fixed max_length. This means batch shapes are dynamic. If you load a batch with very long sequences mixed with short ones, you'll get a large tensor. The fix: set a reasonable max_length in your tokenizer call (before the collator sees it) to cap the maximum any sequence can be.
Error recovery
TypeError: 'DataCollatorWithPadding' object is not callableRuntimeError: expected scalar type Double but found FloatKeyError: 'input_ids'Experienced dev note
In transformers 4.x, you had to manually create attention_mask tensors or use older collators that didn't respect tokenizer config. In 5.5.x, DataCollatorWithPadding automatically inspects your tokenizer's pad_token_id and handles all special tokens. But here's the gotcha: if your tokenizer's pad_token_id is None (common in models like GPT-2), the collator will fail silently or behave unexpectedly. Always run `tokenizer.pad_token_id` before instantiating the collator. If it's None, set it explicitly: `tokenizer.pad_token = tokenizer.eos_token` or use a dedicated pad token. This one line prevents 3 hours of debugging shape mismatches.
Check your understanding
You have a training dataset where 90% of samples are 50 tokens and 10% are 512 tokens. You're using DataCollatorWithPadding. Why might this be inefficient, and what's the one-line fix in your tokenizer call?
Show answer hint
The answer involves understanding that the collator pads each batch to the longest sample in that batch: so batches containing even one 512-token sample force all 49 other samples to be padded to 512. The fix is setting max_length in the tokenizer() call to truncate long sequences before they reach the collator, e.g., tokenizer(..., max_length=128, truncation=True).