Code Advanced medium · 6 min

DataCollatorWithPadding

What you will learn

Automatically pad and truncate tokenized sequences to the same length during batching, replacing manual padding logic.

Why this matters

When training transformers on variable-length sequences, you need consistent tensor shapes for GPU batching. DataCollatorWithPadding eliminates error-prone manual padding code and integrates seamlessly with PyTorch DataLoader, preventing shape mismatches and OOM errors from inefficient padding strategies.

Skip if: You don't need this if you're using Hugging Face Trainer (which applies it by default) or if your dataset already has fixed-length sequences. Also skip it if you're building custom batching logic where you control padding strategy explicitly (e.g., right-padding for causal LM vs. left-padding for encoder models).

Explanation

DataCollatorWithPadding is a callable object that intercepts a batch of tokenized samples and pads them to the longest sequence in that batch (not the global max length). It handles attention masks, token type IDs, and custom keys automatically, converting Python lists into properly-shaped PyTorch tensors. Mechanically, when you pass a batch dict to it, it identifies the longest input_ids, pads shorter sequences with the tokenizer's pad token ID, and creates a corresponding attention_mask that marks padding positions as 0. This is more efficient than padding all sequences to a fixed global max, because batch padding is dynamic: if a batch happens to have short sequences, they stay short. You typically instantiate it with a tokenizer and pass it to PyTorch's DataLoader as the collate_fn parameter, so it processes each batch before returning to your training loop.

Analogy

Think of it as a smart delivery service: instead of wrapping every package to the size of your largest box (waste), DataCollatorWithPadding measures the tallest package in each truck and pads only to that height. Different trucks (batches) get different padding depending on what's inside.

Code

python

from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import Dataset
import torch

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

samples = [
    {'text': 'Hello world'},
    {'text': 'This is a longer sentence for testing'},
    {'text': 'Short'},
]

dataset = Dataset.from_dict({
    'text': [s['text'] for s in samples]
})

def tokenize_fn(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=512
    )

tokenized_dataset = dataset.map(tokenize_fn, batched=True, remove_columns=['text'])

collator = DataCollatorWithPadding(tokenizer=tokenizer)

batch = collator([tokenized_dataset[0], tokenized_dataset[1], tokenized_dataset[2]])

print('Keys in batch:', batch.keys())
print('Input IDs shape:', batch['input_ids'].shape)
print('Attention mask shape:', batch['attention_mask'].shape)
print('\nInput IDs batch:')
print(batch['input_ids'])
print('\nAttention mask batch:')
print(batch['attention_mask'])

Output

Keys in batch: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
Input IDs shape: torch.Size([3, 17])
Attention mask shape: torch.Size([3, 17])

Input IDs batch:
tensor([[ 101, 7592, 2088, 102,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0],
        [ 101, 2023, 2003, 1037, 2936, 6251, 2572, 2005, 3231, 102,    0,    0,
            0,    0,    0,    0,    0],
        [ 101, 2460, 102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0]])

Attention mask batch:
tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

What just happened?

The collator took three tokenized samples of different lengths (4, 10, and 3 tokens respectively after tokenization), identified the longest (10 tokens + special tokens = 17 total), padded all sequences to that length with pad token ID 0, and created an attention_mask that marks real tokens as 1 and padding as 0. All outputs are PyTorch tensors with batch dimension first.

Common gotcha

Developers often forget that DataCollatorWithPadding pads to the longest sequence in the current batch, not to a fixed max_length. This means batch shapes are dynamic. If you load a batch with very long sequences mixed with short ones, you'll get a large tensor. The fix: set a reasonable max_length in your tokenizer call (before the collator sees it) to cap the maximum any sequence can be.

Error recovery

TypeError: 'DataCollatorWithPadding' object is not callable

You created the collator but didn't call it. Pass the collator instance (not the class) to DataLoader's collate_fn parameter, like collate_fn=collator. Or call it manually: batch = collator([sample1, sample2]).

RuntimeError: expected scalar type Double but found Float

Your tokenizer is returning int64 tensors but your model expects float32 or bfloat16. This happens if you're mixing tokenizer output types. Ensure you're using the same tokenizer for both tokenization and collation, and that your model's dtype matches downstream.

KeyError: 'input_ids'

Your dataset samples don't have 'input_ids' key after tokenization. The collator expects tokenized output from tokenizer(). Verify your tokenization map function returns a dict with at least 'input_ids', not just raw text.

Experienced dev note

In transformers 4.x, you had to manually create attention_mask tensors or use older collators that didn't respect tokenizer config. In 5.5.x, DataCollatorWithPadding automatically inspects your tokenizer's pad_token_id and handles all special tokens. But here's the gotcha: if your tokenizer's pad_token_id is None (common in models like GPT-2), the collator will fail silently or behave unexpectedly. Always run `tokenizer.pad_token_id` before instantiating the collator. If it's None, set it explicitly: `tokenizer.pad_token = tokenizer.eos_token` or use a dedicated pad token. This one line prevents 3 hours of debugging shape mismatches.

Check your understanding

You have a training dataset where 90% of samples are 50 tokens and 10% are 512 tokens. You're using DataCollatorWithPadding. Why might this be inefficient, and what's the one-line fix in your tokenizer call?

Show answer hint

The answer involves understanding that the collator pads each batch to the longest sample in that batch: so batches containing even one 512-token sample force all 49 other samples to be padded to 512. The fix is setting max_length in the tokenizer() call to truncate long sequences before they reach the collator, e.g., tokenizer(..., max_length=128, truncation=True).

VERSION In transformers < 4.30.0, DataCollatorWithPadding required manual tokenizer.pad_token_id configuration. In 4.30.0+, it infers pad_token_id from the tokenizer automatically. In 5.5.x, this inference is more robust but still respects explicit overrides via the mlm parameter and return_tensors='pt' is always applied. No breaking changes between 4.30+ and 5.5.x for typical usage.

Explore custom collators that extend DataCollatorWithPadding for specialized tasks like dynamic masking (MLM pretraining) or span-aware padding for question-answering pipelines.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.