Code Intermediate medium · 8 min

Multi-GPU training with DataParallel

What you will learn
DataParallel splits your model across multiple GPUs automatically, but you need to understand its synchronous gradient averaging cost.

Why this matters

Single-GPU training hits memory and speed limits fast. DataParallel is the simplest way to use multiple GPUs on a single machine, but many developers wrap their model and then wonder why speedup is only 1.3x on 4 GPUs: this teaches you why and when to use it.

Skip if: Do not use DataParallel if: (1) you have GPUs on different machines (use DistributedDataParallel instead), (2) your model is compute-heavy and communication-bound (also DistributedDataParallel with better overlap), (3) you're already on PyTorch 2.3+ and your model compiles well (torch.compile may give more speedup than DataParallel), or (4) you have severe batch size constraints that make smaller per-GPU batches impractical.

Explanation

DataParallel wraps your model and replicates it on each GPU, then splits each batch across GPUs and averages gradients after the backward pass. Mechanically: you move your model to the primary GPU and wrap it with nn.DataParallel(model). On each forward pass, DataParallel scatters the input batch across GPUs (GPU 0, GPU 1, etc.), runs the forward in parallel, concatenates outputs, and gathers them back. On backward, gradients are averaged across GPUs synchronously on the primary GPU before the optimizer step. When to use: single-machine multi-GPU training with models that fit in GPU memory. For large distributed training across machines or models that hit communication overhead, use DistributedDataParallel instead, which uses asynchronous all-reduce and better gradient overlap.

Analogy

Think of DataParallel like having multiple cashiers in one store (single machine). You give each cashier a portion of customers (batch), they process in parallel, then they meet at one desk to reconcile totals before the manager (optimizer) books the day. The reconciliation is synchronous and happens at one desk: that's your bottleneck. DistributedDataParallel is like multiple stores with their own managers reconciling independently and asynchronously.

Code

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
num_gpus = torch.cuda.device_count()

print(f'Available GPUs: {num_gpus}')

if num_gpus == 0:
    print('Warning: No GPUs detected. Code will run on CPU.')
    device = 'cpu'
    num_gpus = 1

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(128, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.fc3(x)

model = SimpleModel()
model = model.to(device)

if num_gpus > 1 and device == 'cuda':
    model = nn.DataParallel(model, device_ids=list(range(num_gpus)))
    print(f'Model wrapped with DataParallel on {num_gpus} GPUs')
else:
    print(f'Running on single device: {device}')

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

X_train = torch.randn(64, 128)
y_train = torch.randint(0, 10, (64,))
dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

epochs = 2
for epoch in range(epochs):
    total_loss = 0
    for batch_idx, (X_batch, y_batch) in enumerate(dataloader):
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(dataloader)
    print(f'Epoch {epoch + 1}, Avg Loss: {avg_loss:.4f}')

print('Training complete')
Output
Available GPUs: 0
Warning: No GPUs detected. Code will run on CPU.
Running on single device: cpu
Epoch 1, Avg Loss: 2.3018
Epoch 2, Avg Loss: 2.2961
Training complete

What just happened?

The code instantiated a simple 3-layer neural network, moved it to GPU (if available), wrapped it with DataParallel to enable multi-GPU training (on systems with multiple GPUs), then trained it for 2 epochs on a synthetic dataset. DataParallel automatically split each batch across available GPUs and synchronized gradients. On a system with no GPUs, it gracefully fell back to CPU and ran the model normally without DataParallel.

Common gotcha

The most common mistake is wrapping the model but then forgetting to move it to GPU first: do model.to(device) BEFORE nn.DataParallel(model). Another gotcha: DataParallel synchronously averages gradients on GPU 0, making GPU 0 a bottleneck. If you print model parameters inside the forward pass, you'll see unpredictable device placement because DataParallel scatters tensors. Also, when you save a DataParallel model with state_dict(), it includes the 'module.' prefix: when loading, either load with model.load_state_dict(checkpoint, strict=False) or remove 'module.' prefixes manually.

Error recovery

RuntimeError: Expected all tensors to be on the same device
Input batch and model are on different devices. Ensure X_batch and y_batch are moved to the same device as model with .to(device).
RuntimeError: CUDA out of memory
Batch size is too large for your GPUs. DataParallel splits the batch, but each GPU still needs enough memory for its portion. Reduce batch_size or model size. DataParallel does not reduce memory per GPU: it only reduces overall batch throughput.
AttributeError: 'DataParallel' object has no attribute 'fc1'
You are accessing model.fc1 directly, but DataParallel wraps your model. Access it via model.module.fc1 instead. Or save the unwrapped model reference before wrapping.
KeyError when loading state_dict
The saved checkpoint has 'module.' prefixes but your model is not wrapped with DataParallel (or vice versa). Either wrap/unwrap the model to match the checkpoint structure, or manually rename keys by removing 'module.' before loading.

Experienced dev note

DataParallel looks simple but has a hidden cost: it synchronously averages gradients on GPU 0 after every backward pass, and GPU 0 does all the scatter/gather. This creates a serialization bottleneck that gets worse the more GPUs you add. On a single machine with 4 GPUs, you might see only 2–3x speedup instead of 4x. If you're hitting this wall, switch to DistributedDataParallel (which has async all-reduce) or torch.compile with inductor backend (which fuses operations better). Also, if you're using mixed precision with DataParallel, put the scaler on the primary GPU and handle backward carefully: synchronization can interact badly with loss scaling.

Check your understanding

Why does adding a fourth GPU with DataParallel sometimes give only 1.2x speedup instead of 4x, and what architectural change would fix it?

Show answer hint

The answer must mention GPU 0 synchronous gradient averaging as the bottleneck and identify DistributedDataParallel with its asynchronous all-reduce and gradient overlap as a solution. Simply saying 'communication overhead' is incomplete: they need to understand that DataParallel is synchronous-on-primary while DistributedDataParallel is asynchronous-across-machines.

VERSION DataParallel has been stable since PyTorch 0.4.0 and remains unchanged in PyTorch 2.11.x. However, PyTorch 2.0+ emphasizes torch.compile() and DistributedDataParallel for new projects. DataParallel is not deprecated but is considered legacy for distributed training: use torch.distributed with DistributedDataParallel for any serious multi-GPU work.
NEXT

Master asynchronous distributed training with DistributedDataParallel, which eliminates the single-GPU bottleneck by distributing gradient averaging across all GPUs and supporting multi-machine setups.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.