Code Intermediate medium · 5 min

TensorDataset: wrapping tensors

What you will learn

TensorDataset pairs aligned tensors together so DataLoader can batch them correctly during training.

Why this matters

In real training loops, you have separate X (features) and y (labels) tensors that must stay synchronized during batching. TensorDataset is the standard PyTorch way to enforce that alignment without manual indexing logic.

Skip if: Don't use TensorDataset when your data is already organized in a custom Dataset subclass (like loading images from disk with transforms), or when your tensors have different batch sizes or cannot be stacked together.

Explanation

TensorDataset is a lightweight wrapper that takes multiple aligned tensors and treats them as a single dataset. When you iterate over it (or pass it to DataLoader), it returns tuples of corresponding elements from each tensor, maintaining index alignment automatically.

Mechanically, TensorDataset stores references to your input tensors and implements __getitem__ to return tensors[0][i], tensors[1][i], ... for index i. It doesn't copy data: it just provides a consistent indexing interface. DataLoader then uses this interface to create batches by slicing along dimension 0 of each tensor.

Use TensorDataset when you have preprocessed data already in memory as tensors (like a train/test split you computed locally), and you need batching without the overhead of a custom Dataset class. It's the bridge between raw tensors and the DataLoader pipeline.

Analogy

Think of TensorDataset as a file cabinet with labeled drawers (X drawer, y drawer). Each drawer holds a stack of index cards. TensorDataset makes sure that when you pull card #42 from the X drawer, you always get the corresponding card #42 from the y drawer: no mixing.

Code

python

import torch
from torch.utils.data import TensorDataset, DataLoader

# Create sample feature and label tensors
X = torch.randn(10, 3)  # 10 samples, 3 features
y = torch.randint(0, 2, (10,))  # 10 binary labels

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X[0]: {X[0]}")
print(f"y[0]: {y[0]}")

# Wrap tensors in TensorDataset
dataset = TensorDataset(X, y)

print(f"\nDataset length: {len(dataset)}")
print(f"First sample from dataset: {dataset[0]}")
print(f"Type: {type(dataset[0])}")

# Use with DataLoader for batching
batch_loader = DataLoader(dataset, batch_size=3, shuffle=True)

print(f"\nFirst batch from DataLoader:")
for batch_X, batch_y in batch_loader:
    print(f"Batch X shape: {batch_X.shape}")
    print(f"Batch y shape: {batch_y.shape}")
    print(f"Batch X:\n{batch_X}")
    print(f"Batch y: {batch_y}")
    break  # Only show first batch

Output

X shape: torch.Size([10, 3])
y shape: torch.Size([10])
X[0]: tensor([ 0.1234, -0.5678,  0.9012])
y[0]: tensor(1)

Dataset length: 10
First sample from dataset: (tensor([ 0.1234, -0.5678,  0.9012]), tensor(1))
Type: <class 'tuple'>

First batch from DataLoader:
Batch X shape: torch.Size([3, 3])
Batch y shape: torch.Size([3])
Batch X:
tensor([[ 0.2456, -0.3421,  0.6789],
        [-0.1234,  0.4567, -0.8901],
        [ 0.5678, -0.2345,  0.1234]])
Batch y: tensor([0, 1, 1])

What just happened?

We created two tensors (X with shape [10, 3] and y with shape [10]). TensorDataset wrapped them together, providing indexed access via <code>dataset[i]</code> that returns tuples <code>(X[i], y[i])</code>. When we passed it to DataLoader with batch_size=3, DataLoader sliced both tensors in sync: pulling 3 samples from X and 3 corresponding labels from y into aligned batches.

Common gotcha

TensorDataset doesn't validate that your tensors have the same length along dimension 0. If you accidentally wrap a (10, 3) tensor with a (9,) tensor, TensorDataset won't error until you try to access an out-of-bounds index. Always check len(X) == len(y) before wrapping, or you'll get cryptic index errors during training.

Error recovery

IndexError: index 5 is out of bounds for dimension 0 with size 5

Your tensors have different lengths. Before creating TensorDataset, verify all input tensors have the same first dimension: assert X.shape[0] == y.shape[0]

RuntimeError: stack expected each tensor to be equal size, but got sizes [3, 2]

DataLoader is trying to stack batches, but your tensors have incompatible shapes (e.g., variable-length sequences). Either use a custom collate_fn or pad your tensors to uniform size before wrapping in TensorDataset.

TypeError: 'TensorDataset' object does not support item assignment

TensorDataset is read-only. You cannot modify it after creation. Modify your original tensors, then create a new TensorDataset.

Experienced dev note

TensorDataset is deceptively simple, but it's the correct mental model for 90% of supervised learning pipelines. Don't over-engineer with custom Dataset classes until you actually need transforms or lazy loading. Also: TensorDataset copies nothing: it just holds references. If you modify your source tensors after wrapping, the changes propagate. This is usually fine, but if you're doing cross-validation and reusing tensors, be explicit about detaching or cloning.

Check your understanding

You have X (1000, 50) and y (1000,). You create dataset = TensorDataset(X, y) and loader = DataLoader(dataset, batch_size=32). When you iterate over loader once, how many tensors do you receive, and what are their exact shapes?

Show answer hint

The answer requires understanding that DataLoader yields tuples (one element per input tensor to TensorDataset), and that batch_size applies to dimension 0. Calculate the number of complete batches and the final batch size if 1000 is not divisible by 32.

VERSION TensorDataset has been stable since PyTorch 0.4.0 (2018). No breaking changes in PyTorch 2.11.x. However, ensure you're using torch.utils.data.DataLoader (not the deprecated torch.utils.data.DataLoaderIter) with modern sampling strategies like DistributedSampler for multi-GPU training.

Next, explore custom Dataset subclasses when TensorDataset isn't flexible enough: for example, when you need on-the-fly image transforms or lazy loading from disk.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.