Train/validation split pattern
Why this matters
Training loss alone lies: your model learns the training data perfectly but fails on unseen data. A proper train/validation split lets you catch overfitting in real time and tune hyperparameters on a dataset the model has never seen. Without it, you're flying blind.
Explanation
What it is: Dividing your dataset into two disjoint subsets: one for training the model and one for validating it: to measure how well the model generalizes to unseen data. The model never trains on validation data.
How it works mechanically: You use PyTorch's torch.utils.data.random_split() to partition a dataset, create separate DataLoaders for each, and during validation loop you call model.eval() to disable dropout and batch norm updates, then use torch.no_grad() to skip gradient computation. After each epoch, you compute loss on the validation set without backprop. If validation loss stops decreasing but training loss keeps dropping, your model is overfitting.
When to use it: Always during development. Every serious model needs this split. The typical breakdown is 80/20 or 70/30 training/validation, but adjust based on your dataset size and domain.
Analogy
Training is like studying practice problems with an answer key. Validation is your friend giving you similar problems you've never seen before: their feedback tells you if you actually understand the material or just memorized the answers.
Code
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
torch.manual_seed(42)
X = torch.randn(200, 10)
y = torch.randint(0, 2, (200,))
dataset = TensorDataset(X, y)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
model = nn.Sequential(
nn.Linear(10, 32),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(32, 2)
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
epochs = 5
for epoch in range(epochs):
model.train()
train_loss = 0.0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch)
loss.backward()
optimizer.step()
train_loss += loss.item()
model.eval()
val_loss = 0.0
with torch.no_grad():
for X_batch, y_batch in val_loader:
logits = model(X_batch)
loss = criterion(logits, y_batch)
val_loss += loss.item()
avg_train_loss = train_loss / len(train_loader)
avg_val_loss = val_loss / len(val_loader)
print(f"Epoch {epoch + 1}: train_loss={avg_train_loss:.4f}, val_loss={avg_val_loss:.4f}") Epoch 1: train_loss=0.7108, val_loss=0.7125 Epoch 2: train_loss=0.6342, val_loss=0.6621 Epoch 3: train_loss=0.5389, val_loss=0.6024 Epoch 4: train_loss=0.4503, val_loss=0.5667 Epoch 5: train_loss=0.3681, val_loss=0.5501
What just happened?
The code created a synthetic dataset of 200 samples, split it 80/20 into train (160) and validation (40) subsets. It then trained a small 2-layer neural network with dropout for 5 epochs. Each epoch, the model runs in <code>train()</code> mode during the train loop (dropout active, batch norm accumulates stats), and switches to <code>eval()</code> mode during validation (dropout disabled, batch norm uses running stats). The <code>torch.no_grad()</code> context manager skipped gradient computation on validation batches, reducing memory and speeding up evaluation. Training loss decreased from 0.71 to 0.37; validation loss decreased more slowly from 0.71 to 0.55, showing mild overfitting but no catastrophic divergence.
Common gotcha
Forgetting to call model.eval() before validation. If you skip this, dropout layers stay active during validation (dropping ~30% of activations randomly), batch norm uses batch statistics instead of running stats, and your validation loss becomes unreliable and noisy. Your validation curves will be erratic and you'll think the model is unstable when it's actually fine. Also forgetting torch.no_grad() wastes GPU memory building computational graphs you'll never use for backprop.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceValueError: num_samples should be a positive integerloss is NaN or InfExperienced dev note
In production, the validation set size and split ratio matter more than you think. A 70/30 split can work for 1M samples but will give you a tiny validation set if you only have 500 samples: 150 samples is borderline too small to trust. Also, never use shuffle=True in your validation loader; it doesn't affect correctness but it makes debugging harder when you need to trace which exact sample caused a bad prediction. And if you have severe class imbalance, use stratified_split (from sklearn.model_selection import train_test_split) to ensure both train and val have the same class distribution, not PyTorch's naive random split.
Check your understanding
If your validation loss decreases steadily but training loss becomes erratic and noisy, what is the most likely cause and how would you diagnose it?
Show answer hint
The answer requires recognizing that erratic validation noise suggests dropout or batch norm are active during validation (not calling <code>model.eval()</code>). Training loss erraticism could also indicate a learning rate that's too high. The correct fix is to verify <code>model.eval()</code> is called and check that validation loader uses a reasonable batch size.
volatile=True flag for inference. Use torch.no_grad() instead, which is the modern pattern. Also, torch.utils.data.random_split() has been stable since 1.0.0 with no breaking changes through 2.11.x.