Code Intermediate medium · 6 min

DataLoader: batching and shuffling

What you will learn

DataLoader automatically groups your data into batches and randomizes order to train neural networks efficiently.

Why this matters

Training on one sample at a time is wasteful; batching leverages GPU parallelism. Shuffling prevents your model from memorizing dataset order and overfitting to artifact patterns in how you organized your data.

Skip if: Don't shuffle if you're doing time-series forecasting where temporal order matters, or validation/test sets where you need deterministic reproducibility by order. Don't batch if you're doing inference on a single sample or have a tiny dataset where one full pass per epoch is acceptable.

Explanation

DataLoader is a PyTorch utility that wraps your dataset and handles two critical training logistics: batching (grouping multiple samples together) and shuffling (randomizing the order each epoch). It's not a data container: it's a sampler that pulls from your dataset intelligently.

Mechanically, DataLoader takes a Dataset object, creates indices [0, 1, 2, ..., N-1], optionally shuffles them, then splits them into chunks of size batch_size. Each time you iterate over the DataLoader, it fetches a batch, collates tensors into a single batch tensor, and yields it. When shuffle=True, it reshuffles indices at the start of each epoch: critical because the model would otherwise see the same sequence order repeatedly.

Use batching for any deep learning task (all modern training). Use shuffling for training splits; disable it for validation and test sets so results are reproducible and fair. Pin num_workers > 0 on multi-core systems to load data in parallel while the GPU trains.

Analogy

Think of DataLoader as a waiter in a restaurant. Your dataset is the kitchen with all ingredients. The waiter (DataLoader) doesn't reorganize the kitchen: instead, each time service starts (each epoch), the waiter shuffles the order of tables to serve, grabs 4 tables worth of orders at a time (batch_size=4), and brings that group to the chef (GPU) in parallel. The chef works much faster on 4 orders at once than 1 order at a time.

Code

python

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class SimpleDataset(Dataset):
    def __init__(self, size=100):
        self.data = torch.randn(size, 10)
        self.labels = torch.randint(0, 2, (size,))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = SimpleDataset(size=20)

loader = DataLoader(
    dataset,
    batch_size=4,
    shuffle=True,
    num_workers=0
)

print("Epoch 1:")
for batch_idx, (features, labels) in enumerate(loader):
    print(f"Batch {batch_idx}: features shape {features.shape}, labels {labels.tolist()}")

print("\nEpoch 2 (notice different order):")
for batch_idx, (features, labels) in enumerate(loader):
    print(f"Batch {batch_idx}: features shape {features.shape}, labels {labels.tolist()}")

Output

Epoch 1:
Batch 0: features shape torch.Size([4, 10]), labels [1, 0, 1, 1]
Batch 1: features shape torch.Size([4, 10]), labels [0, 1, 0, 0]
Batch 2: features shape torch.Size([4, 10]), labels [1, 1, 0, 1]
Batch 3: features shape torch.Size([4, 10]), labels [0, 0, 1, 0]
Batch 4: features shape torch.Size([4, 10]), labels [1, 0, 1, 1]

Epoch 2 (notice different order):
Batch 0: features shape torch.Size([4, 10]), labels [0, 1, 1, 0]
Batch 1: features shape torch.Size([4, 10]), labels [1, 1, 0, 1]
Batch 2: features shape torch.Size([4, 10]), labels [0, 0, 1, 1]
Batch 3: features shape torch.Size([4, 10]), labels [1, 0, 0, 1]
Batch 4: features shape torch.Size([4, 10]), labels [1, 0, 1, 0]

What just happened?

We created a 20-sample dataset, wrapped it in DataLoader with batch_size=4 and shuffle=True. The DataLoader divided 20 samples into 5 batches of 4. When we iterated twice, the same samples appeared but in different batch groupings and orders: Epoch 1 and Epoch 2 have different label sequences within each batch, proving shuffle worked. Each iteration stacked 4 samples (which __getitem__ returned individually) into a single batch tensor of shape [4, 10].

Common gotcha

Setting shuffle=True on a validation or test DataLoader. This destroys reproducibility and makes debugging metrics harder. Also: if you have class imbalance and small batches, shuffling might create batches of mostly one class by chance. Use WeightedRandomSampler if you need stratified sampling.

Error recovery

ValueError: batch_size > 1 expected with default collate and this dataset return format

Your __getitem__ returns tensors of inconsistent shapes (e.g., variable-length sequences). Pass a custom collate_fn that pads or stacks them explicitly.

RuntimeError: Too many open files

num_workers > 0 opened more processes than your OS allows. Reduce num_workers (safe default: 2 or 4) or increase ulimit via 'ulimit -n 4096'.

NotImplementedError: shuffling with single_process_data_loading = True

You set num_workers=0 and shuffle=True with a non-standard sampler. num_workers=0 works fine with shuffle=True: error is from custom samplers. Use the default sampler.

IndexError in __getitem__

DataLoader called __getitem__ with an index outside [0, __len__()-1]. Usually means you modified the dataset size during iteration. Lock dataset size in __init__.

Experienced dev note

Most engineers set num_workers = number of CPU cores and call it a day. Reality: you need num_workers ≈ 2–4 on most hardware because spawning a process per core wastes resources on coordination overhead. Also: shuffling is deterministic if you set a random seed with torch.manual_seed(42) before creating the loader: critical for debugging flaky training runs. Use pin_memory=True if moving data to GPU; it speeds up the CPU-to-GPU transfer by pinning data to RAM.

Check your understanding

If you create two DataLoaders with the same Dataset and shuffle=True, why will the batches be in a different order even on the first epoch if you don't set a random seed?

Show answer hint

The answer requires understanding that shuffle uses Python's random module internally (not torch.manual_seed), and that each new DataLoader instance re-initializes the sampler with a fresh random state. Correct answer will mention random seed scope or that shuffle resets per DataLoader instantiation, not per epoch.

VERSION PyTorch 2.11.x: DataLoader API unchanged since 1.0. No breaking changes. However, use torch.Generator(device='cuda') for reproducible shuffling on distributed training (multiprocessing context). In versions < 2.0, pin_memory with CUDA tensors could cause deadlock; this is fixed in 2.11.x.

Dive into <strong>custom collate functions</strong> to handle variable-length sequences and non-standard batch formats: essential for NLP and time-series tasks where simple stacking doesn't work.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.