Code Intermediate medium · 6 min

Train/validation split pattern

What you will learn

Split your dataset into training and validation sets, and use different dropout/batch norm behavior for each to prevent overfitting and get honest performance estimates.

Why this matters

Training loss alone lies: your model learns the training data perfectly but fails on unseen data. A proper train/validation split lets you catch overfitting in real time and tune hyperparameters on a dataset the model has never seen. Without it, you're flying blind.

Skip if: When you're doing quick prototyping or exploratory work on a tiny dataset to verify code works. When you have a separate, held-out test set that's already reserved and large enough. When the dataset is so small that stratified splitting becomes impossible (but honestly, you should still split).

Explanation

What it is: Dividing your dataset into two disjoint subsets: one for training the model and one for validating it: to measure how well the model generalizes to unseen data. The model never trains on validation data.

How it works mechanically: You use PyTorch's torch.utils.data.random_split() to partition a dataset, create separate DataLoaders for each, and during validation loop you call model.eval() to disable dropout and batch norm updates, then use torch.no_grad() to skip gradient computation. After each epoch, you compute loss on the validation set without backprop. If validation loss stops decreasing but training loss keeps dropping, your model is overfitting.

When to use it: Always during development. Every serious model needs this split. The typical breakdown is 80/20 or 70/30 training/validation, but adjust based on your dataset size and domain.

Analogy

Training is like studying practice problems with an answer key. Validation is your friend giving you similar problems you've never seen before: their feedback tells you if you actually understand the material or just memorized the answers.

Code

python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split

torch.manual_seed(42)

X = torch.randn(200, 10)
y = torch.randint(0, 2, (200,))

dataset = TensorDataset(X, y)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(32, 2)
)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 5
for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        logits = model(X_batch)
        loss = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            val_loss += loss.item()
    
    avg_train_loss = train_loss / len(train_loader)
    avg_val_loss = val_loss / len(val_loader)
    print(f"Epoch {epoch + 1}: train_loss={avg_train_loss:.4f}, val_loss={avg_val_loss:.4f}")

Output

Epoch 1: train_loss=0.7108, val_loss=0.7125
Epoch 2: train_loss=0.6342, val_loss=0.6621
Epoch 3: train_loss=0.5389, val_loss=0.6024
Epoch 4: train_loss=0.4503, val_loss=0.5667
Epoch 5: train_loss=0.3681, val_loss=0.5501

What just happened?

The code created a synthetic dataset of 200 samples, split it 80/20 into train (160) and validation (40) subsets. It then trained a small 2-layer neural network with dropout for 5 epochs. Each epoch, the model runs in <code>train()</code> mode during the train loop (dropout active, batch norm accumulates stats), and switches to <code>eval()</code> mode during validation (dropout disabled, batch norm uses running stats). The <code>torch.no_grad()</code> context manager skipped gradient computation on validation batches, reducing memory and speeding up evaluation. Training loss decreased from 0.71 to 0.37; validation loss decreased more slowly from 0.71 to 0.55, showing mild overfitting but no catastrophic divergence.

Common gotcha

Forgetting to call model.eval() before validation. If you skip this, dropout layers stay active during validation (dropping ~30% of activations randomly), batch norm uses batch statistics instead of running stats, and your validation loss becomes unreliable and noisy. Your validation curves will be erratic and you'll think the model is unstable when it's actually fine. Also forgetting torch.no_grad() wastes GPU memory building computational graphs you'll never use for backprop.

Error recovery

RuntimeError: Expected all tensors to be on the same device

Your model is on GPU but your batch is on CPU (or vice versa). Add <code>device = 'cuda' if torch.cuda.is_available() else 'cpu' model = model.to(device) X_batch, y_batch = X_batch.to(device), y_batch.to(device)</code> inside your loop.

ValueError: num_samples should be a positive integer

Your <code>random_split()</code> sizes don't add up to dataset length. If dataset has 200 items and you split [160, 41], it fails. Verify <code>train_size + val_size == len(dataset)</code>.

loss is NaN or Inf

Your learning rate is too high or your data is not normalized. Reduce lr to 0.0001 or normalize inputs with <code>X = (X - X.mean()) / X.std()</code> before creating the dataset.

Experienced dev note

In production, the validation set size and split ratio matter more than you think. A 70/30 split can work for 1M samples but will give you a tiny validation set if you only have 500 samples: 150 samples is borderline too small to trust. Also, never use shuffle=True in your validation loader; it doesn't affect correctness but it makes debugging harder when you need to trace which exact sample caused a bad prediction. And if you have severe class imbalance, use stratified_split (from sklearn.model_selection import train_test_split) to ensure both train and val have the same class distribution, not PyTorch's naive random split.

Check your understanding

If your validation loss decreases steadily but training loss becomes erratic and noisy, what is the most likely cause and how would you diagnose it?

Show answer hint

The answer requires recognizing that erratic validation noise suggests dropout or batch norm are active during validation (not calling <code>model.eval()</code>). Training loss erraticism could also indicate a learning rate that's too high. The correct fix is to verify <code>model.eval()</code> is called and check that validation loader uses a reasonable batch size.

VERSION PyTorch 2.0+ deprecated the volatile=True flag for inference. Use torch.no_grad() instead, which is the modern pattern. Also, torch.utils.data.random_split() has been stable since 1.0.0 with no breaking changes through 2.11.x.

Once you nail the train/validation split, learn about <strong>early stopping</strong>: automatically stopping training when validation loss stops improving, rather than guessing the right number of epochs.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.