Code Intermediate medium · 7 min

Custom Dataset: loading from disk

What you will learn

Build a custom PyTorch Dataset subclass that loads image files from disk and serves them to a DataLoader.

Why this matters

Real-world data lives on disk, not in RAM. You need to load it efficiently without consuming all memory. A custom Dataset lets you implement lazy loading: files are read only when needed, not all at once.

Skip if: Don't use custom Dataset classes for toy problems with data small enough to fit in memory. If your entire dataset is <1GB and fits comfortably in RAM, load it once with numpy/torch and wrap it in a TensorDataset. Also skip this if you're using a pre-built dataset from torchvision or Hugging Face: they already handle loading.

Explanation

What it is: A custom Dataset is a Python class that inherits from torch.utils.data.Dataset and defines how to load individual samples from disk. It's lazy: only loads data when __getitem__ is called, not at initialization.

How it works: You override two methods: __len__ returns the total number of samples, and __getitem__ takes an index and returns the actual data (and label). When you pass this to a DataLoader with num_workers > 0, PyTorch spawns background processes that call __getitem__ concurrently, prefetching batches while your GPU trains. Each worker loads files independently: no competition for disk I/O.

When to use it: Use this pattern whenever your dataset is too large to fit in memory, or when loading is expensive (image resizing, decompression). It's the standard for image classification, object detection, and any real-world deep learning project.

Analogy

Think of it like a library catalog. The catalog (Dataset) knows which books exist and their locations, but doesn't load every book into memory. The librarian (DataLoader with workers) fetches books on demand: if you ask for book #42, they go get it. Multiple librarians working in parallel keep books flowing to you without you waiting.

Code

python

import os
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import numpy as np

class DiskImageDataset(Dataset):
    def __init__(self, image_dir, label_file):
        """
        Args:
            image_dir: path to folder containing image files
            label_file: path to .txt file with lines: "filename label"
        """
        self.image_dir = image_dir
        self.image_paths = []
        self.labels = []
        
        with open(label_file, 'r') as f:
            for line in f:
                parts = line.strip().split()
                filename, label = parts[0], int(parts[1])
                self.image_paths.append(os.path.join(image_dir, filename))
                self.labels.append(label)
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        label = self.labels[idx]
        image = Image.open(img_path).convert('RGB')
        image_array = np.array(image, dtype=np.float32) / 255.0
        image_tensor = torch.from_numpy(image_array).permute(2, 0, 1)
        return image_tensor, label

if __name__ == '__main__':
    os.makedirs('sample_images', exist_ok=True)
    os.makedirs('sample_data', exist_ok=True)
    
    img_array = np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)
    img = Image.fromarray(img_array)
    img.save('sample_images/cat_001.jpg')
    
    img_array = np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)
    img = Image.fromarray(img_array)
    img.save('sample_images/dog_001.jpg')
    
    with open('sample_data/labels.txt', 'w') as f:
        f.write('cat_001.jpg 0\n')
        f.write('dog_001.jpg 1\n')
    
    dataset = DiskImageDataset('sample_images', 'sample_data/labels.txt')
    print(f'Dataset size: {len(dataset)}')
    
    image_tensor, label = dataset[0]
    print(f'Image shape: {image_tensor.shape}')
    print(f'Label: {label}')
    print(f'Image min: {image_tensor.min():.2f}, max: {image_tensor.max():.2f}')
    
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True, num_workers=0)
    for batch_images, batch_labels in dataloader:
        print(f'Batch images shape: {batch_images.shape}')
        print(f'Batch labels: {batch_labels}')
        break

Output

Dataset size: 2
Image shape: torch.Size([3, 32, 32])
Label: 0
Image min: 0.00, max: 1.00
Batch images shape: torch.Size([2, 3, 32, 32])
Batch labels: tensor([0, 1])

What just happened?

You created a Dataset subclass that stores paths and labels in memory but loads actual images only when indexed. When you called <code>dataset[0]</code>, <code>__getitem__</code> read the image file from disk, converted it to a tensor, and normalized pixel values to [0, 1]. When you passed the dataset to DataLoader, it batched two samples and returned a tensor of shape (2, 3, 32, 32). The file I/O happened on demand, not at initialization.

Common gotcha

Forgetting to call .convert('RGB') when opening images. If your images are RGBA or grayscale, Image.open() will preserve that mode, and np.array(image) will have the wrong number of channels. Then .permute(2, 0, 1) will fail or produce wrong dimensions. Always normalize channels explicitly.

Error recovery

FileNotFoundError

Label file path doesn't exist or doesn't match the working directory. Use <code>os.path.abspath()</code> to debug. Verify the label file format is exactly 'filename label' with a newline at the end.

IndexError in line 'parts = line.strip().split()'

Label file has lines with fewer than 2 space-separated values. Check for empty lines or malformed entries. Add <code>if len(parts) >= 2:</code> before unpacking.

RuntimeError: stack expects each tensor to be equal size

Images have different dimensions (32x32 vs 64x64). DataLoader can't stack them into a single batch. Resize all images to the same shape in <code>__getitem__</code> using <code>image.resize((target_h, target_w))</code>.

PermissionError when opening image

Image file is locked or you lack read permissions. Check file exists with <code>os.path.exists()</code> and verify the full path with <code>print(img_path)</code> before <code>Image.open()</code>.

Experienced dev note

The most common mistake is calling Image.open() in __init__ to pre-check files exist. This defeats lazy loading: you load every image at dataset creation time, consuming RAM and causing slow startup. Instead, lazy-check: call os.path.exists() on paths at init, but don't load images until __getitem__. Also, when using num_workers > 0, ensure your Dataset class is picklable (serializable): move file I/O logic into __getitem__, not __init__, because workers fork and must serialize the entire object.

Check your understanding

You create a Dataset with 10,000 images. Your laptop has 8GB RAM. If you load all images into a list in __init__, what happens? Now explain how lazy loading (this pattern) solves the problem, and describe when a worker process reads a file from disk relative to when the main process calls .backward().

Show answer hint

A correct answer mentions that pre-loading exhausts RAM before training starts. Lazy loading reads files only in <code>__getitem__</code>, so at most <code>batch_size * num_workers</code> images are in memory at once. With <code>num_workers > 0</code>, workers read files while the main process is computing gradients: overlapped I/O and computation.

VERSION No breaking changes in Dataset or DataLoader between PyTorch 2.6.x and 2.11.x. This code is forward-compatible.

Next, learn how to use DataLoader's <code>collate_fn</code> to handle variable-sized samples and create custom batching logic beyond simple stacking.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.