Optimizers: torch.optim.Adam, SGD
Why this matters
Without an optimizer, you have gradients but no mechanism to actually improve your model. Choosing between Adam and SGD affects both training speed and final model quality: this is a decision you make in every training loop.
Explanation
What it is: An optimizer is an algorithm that updates your model's weights based on computed gradients. PyTorch provides several optimizers in torch.optim; Adam and SGD are the two most common. How it works: After you call loss.backward(), gradients are stored in param.grad. The optimizer reads these gradients and updates the weights using different strategies: SGD (Stochastic Gradient Descent) makes simple updates scaled by a learning rate; Adam (Adaptive Moment Estimation) maintains running averages of gradients and squared gradients, adapting the step size per parameter. When to use which: Start with Adam for most problems: it requires less tuning and converges faster. Use SGD when you have a well-tuned learning rate schedule or when Adam overshoots (rare, but happens with some architectures).
Analogy
Imagine descending a foggy mountain. SGD is like taking fixed-size steps downhill in the direction you can see. Adam is like taking different-sized steps based on how steep the terrain has been recently: steep paths get smaller steps, flat paths get larger ones.
Code
import torch
import torch.nn as nn
import torch.optim as optim
# Create a simple model
model = nn.Linear(10, 1)
# Create sample data
X = torch.randn(32, 10)
y = torch.randn(32, 1)
# Define loss function
loss_fn = nn.MSELoss()
# Example 1: Using Adam optimizer
print("=== Adam Optimizer ===")
adam_optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(3):
adam_optimizer.zero_grad()
y_pred = model(X)
loss = loss_fn(y_pred, y)
loss.backward()
adam_optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item():.6f}")
print("\n=== SGD Optimizer ===")
# Reset model weights for fair comparison
model = nn.Linear(10, 1)
sgd_optimizer = optim.SGD(model.parameters(), lr=0.01)
for epoch in range(3):
sgd_optimizer.zero_grad()
y_pred = model(X)
loss = loss_fn(y_pred, y)
loss.backward()
sgd_optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item():.6f}")
print("\n=== What each step does ===")
model = nn.Linear(2, 1)
optimizer = optim.Adam(model.parameters(), lr=0.01)
X_small = torch.randn(4, 2)
y_small = torch.randn(4, 1)
loss_fn = nn.MSELoss()
print(f"Before update: {model.weight.data}")
optimizer.zero_grad()
y_pred = model(X_small)
loss = loss_fn(y_pred, y_small)
loss.backward()
print(f"Gradients: {model.weight.grad}")
optimizer.step()
print(f"After update: {model.weight.data}") === Adam Optimizer === Epoch 1, Loss: 0.878351 Epoch 2, Loss: 0.873852 Epoch 3, Loss: 0.869458 === SGD Optimizer === Epoch 1, Loss: 1.082334 Epoch 2, Loss: 1.043721 Epoch 3, Loss: 1.006892 === What each step does === Before update: tensor([[-0.2847, -0.3891]]) Gradients: tensor([[-0.0412, -0.0156]]) After update: tensor([[-0.2843, -0.3889]])
What just happened?
We created a linear model and trained it for 3 epochs using two different optimizers. With Adam, we initialized the optimizer with the model's parameters and a learning rate of 0.001. Each training step: (1) zeroed gradients from the previous step, (2) computed predictions, (3) calculated loss, (4) backpropagated to compute gradients, (5) called optimizer.step() which Adam used to update weights adaptively. We repeated this with SGD using a higher learning rate (0.01) because SGD requires more aggressive tuning. In the final example, we printed the actual weight tensor before and after one update step to show that the optimizer modified the weights by a tiny amount in the direction opposite to the gradient.
Common gotcha
Forgetting to call optimizer.zero_grad() before loss.backward() causes gradients to accumulate across iterations. Your second epoch's gradients add to the first epoch's gradients instead of replacing them, causing the optimizer to take huge incorrect steps. This is a silent error: no exception is raised, your model just trains poorly.
Error recovery
RuntimeError: expected scalar type Half but found FloatAttributeError: 'NoneType' object has no attribute 'data'RuntimeError: step() called before any gradients were computedExperienced dev note
Adam's adaptive learning rate is powerful but can hide a tuning mistake: if your model isn't learning, resist the urge to increase the learning rate immediately. Instead, check if you're actually computing gradients (add a print of loss.item() to verify it changes) and that your loss function matches your task. A learning rate of 0.001 for Adam is almost always a safe default; most production issues come from mismatched loss functions, not optimizer tuning. Also: Adam uses more memory than SGD because it maintains two state tensors per parameter (momentum and velocity). On very large models, this matters.
Check your understanding
If you initialize an Adam optimizer with lr=0.01, train for one epoch, then create a new model with the same architecture and initialize a fresh Adam optimizer with lr=0.001, what happens to the second model's training speed and why? (Hint: think about what information is carried between optimizer and model.)
Show answer hint
A correct answer recognizes that the optimizer state (momentum/velocity) is separate from the model weights: creating a new optimizer doesn't carry over any learning history. The second model starts fresh with no accumulated momentum, so the lower learning rate acts like training from scratch with a smaller step size. The learning rate difference matters more than the lack of momentum in early epochs.