DataLoader: batching and shuffling
Why this matters
Training on one sample at a time is wasteful; batching leverages GPU parallelism. Shuffling prevents your model from memorizing dataset order and overfitting to artifact patterns in how you organized your data.
Explanation
DataLoader is a PyTorch utility that wraps your dataset and handles two critical training logistics: batching (grouping multiple samples together) and shuffling (randomizing the order each epoch). It's not a data container: it's a sampler that pulls from your dataset intelligently.
Mechanically, DataLoader takes a Dataset object, creates indices [0, 1, 2, ..., N-1], optionally shuffles them, then splits them into chunks of size batch_size. Each time you iterate over the DataLoader, it fetches a batch, collates tensors into a single batch tensor, and yields it. When shuffle=True, it reshuffles indices at the start of each epoch: critical because the model would otherwise see the same sequence order repeatedly.
Use batching for any deep learning task (all modern training). Use shuffling for training splits; disable it for validation and test sets so results are reproducible and fair. Pin num_workers > 0 on multi-core systems to load data in parallel while the GPU trains.
Analogy
Think of DataLoader as a waiter in a restaurant. Your dataset is the kitchen with all ingredients. The waiter (DataLoader) doesn't reorganize the kitchen: instead, each time service starts (each epoch), the waiter shuffles the order of tables to serve, grabs 4 tables worth of orders at a time (batch_size=4), and brings that group to the chef (GPU) in parallel. The chef works much faster on 4 orders at once than 1 order at a time.
Code
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
class SimpleDataset(Dataset):
def __init__(self, size=100):
self.data = torch.randn(size, 10)
self.labels = torch.randint(0, 2, (size,))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
dataset = SimpleDataset(size=20)
loader = DataLoader(
dataset,
batch_size=4,
shuffle=True,
num_workers=0
)
print("Epoch 1:")
for batch_idx, (features, labels) in enumerate(loader):
print(f"Batch {batch_idx}: features shape {features.shape}, labels {labels.tolist()}")
print("\nEpoch 2 (notice different order):")
for batch_idx, (features, labels) in enumerate(loader):
print(f"Batch {batch_idx}: features shape {features.shape}, labels {labels.tolist()}") Epoch 1: Batch 0: features shape torch.Size([4, 10]), labels [1, 0, 1, 1] Batch 1: features shape torch.Size([4, 10]), labels [0, 1, 0, 0] Batch 2: features shape torch.Size([4, 10]), labels [1, 1, 0, 1] Batch 3: features shape torch.Size([4, 10]), labels [0, 0, 1, 0] Batch 4: features shape torch.Size([4, 10]), labels [1, 0, 1, 1] Epoch 2 (notice different order): Batch 0: features shape torch.Size([4, 10]), labels [0, 1, 1, 0] Batch 1: features shape torch.Size([4, 10]), labels [1, 1, 0, 1] Batch 2: features shape torch.Size([4, 10]), labels [0, 0, 1, 1] Batch 3: features shape torch.Size([4, 10]), labels [1, 0, 0, 1] Batch 4: features shape torch.Size([4, 10]), labels [1, 0, 1, 0]
What just happened?
We created a 20-sample dataset, wrapped it in DataLoader with batch_size=4 and shuffle=True. The DataLoader divided 20 samples into 5 batches of 4. When we iterated twice, the same samples appeared but in different batch groupings and orders: Epoch 1 and Epoch 2 have different label sequences within each batch, proving shuffle worked. Each iteration stacked 4 samples (which __getitem__ returned individually) into a single batch tensor of shape [4, 10].
Common gotcha
Setting shuffle=True on a validation or test DataLoader. This destroys reproducibility and makes debugging metrics harder. Also: if you have class imbalance and small batches, shuffling might create batches of mostly one class by chance. Use WeightedRandomSampler if you need stratified sampling.
Error recovery
ValueError: batch_size > 1 expected with default collate and this dataset return formatRuntimeError: Too many open filesNotImplementedError: shuffling with single_process_data_loading = TrueIndexError in __getitem__Experienced dev note
Most engineers set num_workers = number of CPU cores and call it a day. Reality: you need num_workers ≈ 2–4 on most hardware because spawning a process per core wastes resources on coordination overhead. Also: shuffling is deterministic if you set a random seed with torch.manual_seed(42) before creating the loader: critical for debugging flaky training runs. Use pin_memory=True if moving data to GPU; it speeds up the CPU-to-GPU transfer by pinning data to RAM.
Check your understanding
If you create two DataLoaders with the same Dataset and shuffle=True, why will the batches be in a different order even on the first epoch if you don't set a random seed?
Show answer hint
The answer requires understanding that shuffle uses Python's random module internally (not torch.manual_seed), and that each new DataLoader instance re-initializes the sampler with a fresh random state. Correct answer will mention random seed scope or that shuffle resets per DataLoader instantiation, not per epoch.