TensorDataset: wrapping tensors
Why this matters
In real training loops, you have separate X (features) and y (labels) tensors that must stay synchronized during batching. TensorDataset is the standard PyTorch way to enforce that alignment without manual indexing logic.
Explanation
TensorDataset is a lightweight wrapper that takes multiple aligned tensors and treats them as a single dataset. When you iterate over it (or pass it to DataLoader), it returns tuples of corresponding elements from each tensor, maintaining index alignment automatically.
Mechanically, TensorDataset stores references to your input tensors and implements __getitem__ to return tensors[0][i], tensors[1][i], ... for index i. It doesn't copy data: it just provides a consistent indexing interface. DataLoader then uses this interface to create batches by slicing along dimension 0 of each tensor.
Use TensorDataset when you have preprocessed data already in memory as tensors (like a train/test split you computed locally), and you need batching without the overhead of a custom Dataset class. It's the bridge between raw tensors and the DataLoader pipeline.
Analogy
Think of TensorDataset as a file cabinet with labeled drawers (X drawer, y drawer). Each drawer holds a stack of index cards. TensorDataset makes sure that when you pull card #42 from the X drawer, you always get the corresponding card #42 from the y drawer: no mixing.
Code
import torch
from torch.utils.data import TensorDataset, DataLoader
# Create sample feature and label tensors
X = torch.randn(10, 3) # 10 samples, 3 features
y = torch.randint(0, 2, (10,)) # 10 binary labels
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X[0]: {X[0]}")
print(f"y[0]: {y[0]}")
# Wrap tensors in TensorDataset
dataset = TensorDataset(X, y)
print(f"\nDataset length: {len(dataset)}")
print(f"First sample from dataset: {dataset[0]}")
print(f"Type: {type(dataset[0])}")
# Use with DataLoader for batching
batch_loader = DataLoader(dataset, batch_size=3, shuffle=True)
print(f"\nFirst batch from DataLoader:")
for batch_X, batch_y in batch_loader:
print(f"Batch X shape: {batch_X.shape}")
print(f"Batch y shape: {batch_y.shape}")
print(f"Batch X:\n{batch_X}")
print(f"Batch y: {batch_y}")
break # Only show first batch X shape: torch.Size([10, 3])
y shape: torch.Size([10])
X[0]: tensor([ 0.1234, -0.5678, 0.9012])
y[0]: tensor(1)
Dataset length: 10
First sample from dataset: (tensor([ 0.1234, -0.5678, 0.9012]), tensor(1))
Type: <class 'tuple'>
First batch from DataLoader:
Batch X shape: torch.Size([3, 3])
Batch y shape: torch.Size([3])
Batch X:
tensor([[ 0.2456, -0.3421, 0.6789],
[-0.1234, 0.4567, -0.8901],
[ 0.5678, -0.2345, 0.1234]])
Batch y: tensor([0, 1, 1]) What just happened?
We created two tensors (X with shape [10, 3] and y with shape [10]). TensorDataset wrapped them together, providing indexed access via <code>dataset[i]</code> that returns tuples <code>(X[i], y[i])</code>. When we passed it to DataLoader with batch_size=3, DataLoader sliced both tensors in sync: pulling 3 samples from X and 3 corresponding labels from y into aligned batches.
Common gotcha
TensorDataset doesn't validate that your tensors have the same length along dimension 0. If you accidentally wrap a (10, 3) tensor with a (9,) tensor, TensorDataset won't error until you try to access an out-of-bounds index. Always check len(X) == len(y) before wrapping, or you'll get cryptic index errors during training.
Error recovery
IndexError: index 5 is out of bounds for dimension 0 with size 5RuntimeError: stack expected each tensor to be equal size, but got sizes [3, 2]TypeError: 'TensorDataset' object does not support item assignmentExperienced dev note
TensorDataset is deceptively simple, but it's the correct mental model for 90% of supervised learning pipelines. Don't over-engineer with custom Dataset classes until you actually need transforms or lazy loading. Also: TensorDataset copies nothing: it just holds references. If you modify your source tensors after wrapping, the changes propagate. This is usually fine, but if you're doing cross-validation and reusing tensors, be explicit about detaching or cloning.
Check your understanding
You have X (1000, 50) and y (1000,). You create dataset = TensorDataset(X, y) and loader = DataLoader(dataset, batch_size=32). When you iterate over loader once, how many tensors do you receive, and what are their exact shapes?
Show answer hint
The answer requires understanding that DataLoader yields tuples (one element per input tensor to TensorDataset), and that batch_size applies to dimension 0. Calculate the number of complete batches and the final batch size if 1000 is not divisible by 32.