Custom Dataset: loading from disk
Why this matters
Real-world data lives on disk, not in RAM. You need to load it efficiently without consuming all memory. A custom Dataset lets you implement lazy loading: files are read only when needed, not all at once.
Explanation
What it is: A custom Dataset is a Python class that inherits from torch.utils.data.Dataset and defines how to load individual samples from disk. It's lazy: only loads data when __getitem__ is called, not at initialization.
How it works: You override two methods: __len__ returns the total number of samples, and __getitem__ takes an index and returns the actual data (and label). When you pass this to a DataLoader with num_workers > 0, PyTorch spawns background processes that call __getitem__ concurrently, prefetching batches while your GPU trains. Each worker loads files independently: no competition for disk I/O.
When to use it: Use this pattern whenever your dataset is too large to fit in memory, or when loading is expensive (image resizing, decompression). It's the standard for image classification, object detection, and any real-world deep learning project.
Analogy
Think of it like a library catalog. The catalog (Dataset) knows which books exist and their locations, but doesn't load every book into memory. The librarian (DataLoader with workers) fetches books on demand: if you ask for book #42, they go get it. Multiple librarians working in parallel keep books flowing to you without you waiting.
Code
import os
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import numpy as np
class DiskImageDataset(Dataset):
def __init__(self, image_dir, label_file):
"""
Args:
image_dir: path to folder containing image files
label_file: path to .txt file with lines: "filename label"
"""
self.image_dir = image_dir
self.image_paths = []
self.labels = []
with open(label_file, 'r') as f:
for line in f:
parts = line.strip().split()
filename, label = parts[0], int(parts[1])
self.image_paths.append(os.path.join(image_dir, filename))
self.labels.append(label)
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
img_path = self.image_paths[idx]
label = self.labels[idx]
image = Image.open(img_path).convert('RGB')
image_array = np.array(image, dtype=np.float32) / 255.0
image_tensor = torch.from_numpy(image_array).permute(2, 0, 1)
return image_tensor, label
if __name__ == '__main__':
os.makedirs('sample_images', exist_ok=True)
os.makedirs('sample_data', exist_ok=True)
img_array = np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)
img = Image.fromarray(img_array)
img.save('sample_images/cat_001.jpg')
img_array = np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)
img = Image.fromarray(img_array)
img.save('sample_images/dog_001.jpg')
with open('sample_data/labels.txt', 'w') as f:
f.write('cat_001.jpg 0\n')
f.write('dog_001.jpg 1\n')
dataset = DiskImageDataset('sample_images', 'sample_data/labels.txt')
print(f'Dataset size: {len(dataset)}')
image_tensor, label = dataset[0]
print(f'Image shape: {image_tensor.shape}')
print(f'Label: {label}')
print(f'Image min: {image_tensor.min():.2f}, max: {image_tensor.max():.2f}')
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, num_workers=0)
for batch_images, batch_labels in dataloader:
print(f'Batch images shape: {batch_images.shape}')
print(f'Batch labels: {batch_labels}')
break Dataset size: 2 Image shape: torch.Size([3, 32, 32]) Label: 0 Image min: 0.00, max: 1.00 Batch images shape: torch.Size([2, 3, 32, 32]) Batch labels: tensor([0, 1])
What just happened?
You created a Dataset subclass that stores paths and labels in memory but loads actual images only when indexed. When you called <code>dataset[0]</code>, <code>__getitem__</code> read the image file from disk, converted it to a tensor, and normalized pixel values to [0, 1]. When you passed the dataset to DataLoader, it batched two samples and returned a tensor of shape (2, 3, 32, 32). The file I/O happened on demand, not at initialization.
Common gotcha
Forgetting to call .convert('RGB') when opening images. If your images are RGBA or grayscale, Image.open() will preserve that mode, and np.array(image) will have the wrong number of channels. Then .permute(2, 0, 1) will fail or produce wrong dimensions. Always normalize channels explicitly.
Error recovery
FileNotFoundErrorIndexError in line 'parts = line.strip().split()'RuntimeError: stack expects each tensor to be equal sizePermissionError when opening imageExperienced dev note
The most common mistake is calling Image.open() in __init__ to pre-check files exist. This defeats lazy loading: you load every image at dataset creation time, consuming RAM and causing slow startup. Instead, lazy-check: call os.path.exists() on paths at init, but don't load images until __getitem__. Also, when using num_workers > 0, ensure your Dataset class is picklable (serializable): move file I/O logic into __getitem__, not __init__, because workers fork and must serialize the entire object.
Check your understanding
You create a Dataset with 10,000 images. Your laptop has 8GB RAM. If you load all images into a list in __init__, what happens? Now explain how lazy loading (this pattern) solves the problem, and describe when a worker process reads a file from disk relative to when the main process calls .backward().
Show answer hint
A correct answer mentions that pre-loading exhausts RAM before training starts. Lazy loading reads files only in <code>__getitem__</code>, so at most <code>batch_size * num_workers</code> images are in memory at once. With <code>num_workers > 0</code>, workers read files while the main process is computing gradients: overlapped I/O and computation.