Code Beginner easy · 5 min

Optimizers: torch.optim.Adam, SGD

What you will learn

Optimizers update model weights during training by following the gradient signal in different ways.

Why this matters

Without an optimizer, you have gradients but no mechanism to actually improve your model. Choosing between Adam and SGD affects both training speed and final model quality: this is a decision you make in every training loop.

Skip if: You don't need an optimizer during inference or when using a pre-trained frozen model for feature extraction only. You also don't use optimizers when doing one-shot prediction: only during the training phase.

Explanation

What it is: An optimizer is an algorithm that updates your model's weights based on computed gradients. PyTorch provides several optimizers in torch.optim; Adam and SGD are the two most common. How it works: After you call loss.backward(), gradients are stored in param.grad. The optimizer reads these gradients and updates the weights using different strategies: SGD (Stochastic Gradient Descent) makes simple updates scaled by a learning rate; Adam (Adaptive Moment Estimation) maintains running averages of gradients and squared gradients, adapting the step size per parameter. When to use which: Start with Adam for most problems: it requires less tuning and converges faster. Use SGD when you have a well-tuned learning rate schedule or when Adam overshoots (rare, but happens with some architectures).

Analogy

Imagine descending a foggy mountain. SGD is like taking fixed-size steps downhill in the direction you can see. Adam is like taking different-sized steps based on how steep the terrain has been recently: steep paths get smaller steps, flat paths get larger ones.

Code

python

import torch
import torch.nn as nn
import torch.optim as optim

# Create a simple model
model = nn.Linear(10, 1)

# Create sample data
X = torch.randn(32, 10)
y = torch.randn(32, 1)

# Define loss function
loss_fn = nn.MSELoss()

# Example 1: Using Adam optimizer
print("=== Adam Optimizer ===")
adam_optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(3):
    adam_optimizer.zero_grad()
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    loss.backward()
    adam_optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item():.6f}")

print("\n=== SGD Optimizer ===")
# Reset model weights for fair comparison
model = nn.Linear(10, 1)
sgd_optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(3):
    sgd_optimizer.zero_grad()
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    loss.backward()
    sgd_optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item():.6f}")

print("\n=== What each step does ===")
model = nn.Linear(2, 1)
optimizer = optim.Adam(model.parameters(), lr=0.01)
X_small = torch.randn(4, 2)
y_small = torch.randn(4, 1)
loss_fn = nn.MSELoss()

print(f"Before update: {model.weight.data}")
optimizer.zero_grad()
y_pred = model(X_small)
loss = loss_fn(y_pred, y_small)
loss.backward()
print(f"Gradients: {model.weight.grad}")
optimizer.step()
print(f"After update: {model.weight.data}")

Output

=== Adam Optimizer ===
Epoch 1, Loss: 0.878351
Epoch 2, Loss: 0.873852
Epoch 3, Loss: 0.869458

=== SGD Optimizer ===
Epoch 1, Loss: 1.082334
Epoch 2, Loss: 1.043721
Epoch 3, Loss: 1.006892

=== What each step does ===
Before update: tensor([[-0.2847, -0.3891]])
Gradients: tensor([[-0.0412, -0.0156]])
After update: tensor([[-0.2843, -0.3889]])

What just happened?

We created a linear model and trained it for 3 epochs using two different optimizers. With Adam, we initialized the optimizer with the model's parameters and a learning rate of 0.001. Each training step: (1) zeroed gradients from the previous step, (2) computed predictions, (3) calculated loss, (4) backpropagated to compute gradients, (5) called optimizer.step() which Adam used to update weights adaptively. We repeated this with SGD using a higher learning rate (0.01) because SGD requires more aggressive tuning. In the final example, we printed the actual weight tensor before and after one update step to show that the optimizer modified the weights by a tiny amount in the direction opposite to the gradient.

Common gotcha

Forgetting to call optimizer.zero_grad() before loss.backward() causes gradients to accumulate across iterations. Your second epoch's gradients add to the first epoch's gradients instead of replacing them, causing the optimizer to take huge incorrect steps. This is a silent error: no exception is raised, your model just trains poorly.

Error recovery

RuntimeError: expected scalar type Half but found Float

The optimizer was created with a float model but receives gradients from a half-precision (float16) model due to mixed precision training. Fix: either use the same precision throughout or wrap your forward pass with torch.amp.autocast('cuda') if you intend mixed precision.

AttributeError: 'NoneType' object has no attribute 'data'

You tried to pass something to the optimizer that isn't a parameter. Common cause: passing model.features instead of model.features.parameters(). Fix: always use model.parameters() or model.named_parameters() when initializing the optimizer.

RuntimeError: step() called before any gradients were computed

You called optimizer.step() without calling loss.backward() first. The optimizer has no gradient information to use. Fix: ensure your training loop is: zero_grad → forward → loss → backward → step, in exactly that order.

Experienced dev note

Adam's adaptive learning rate is powerful but can hide a tuning mistake: if your model isn't learning, resist the urge to increase the learning rate immediately. Instead, check if you're actually computing gradients (add a print of loss.item() to verify it changes) and that your loss function matches your task. A learning rate of 0.001 for Adam is almost always a safe default; most production issues come from mismatched loss functions, not optimizer tuning. Also: Adam uses more memory than SGD because it maintains two state tensors per parameter (momentum and velocity). On very large models, this matters.

Check your understanding

If you initialize an Adam optimizer with lr=0.01, train for one epoch, then create a new model with the same architecture and initialize a fresh Adam optimizer with lr=0.001, what happens to the second model's training speed and why? (Hint: think about what information is carried between optimizer and model.)

Show answer hint

A correct answer recognizes that the optimizer state (momentum/velocity) is separate from the model weights: creating a new optimizer doesn't carry over any learning history. The second model starts fresh with no accumulated momentum, so the lower learning rate acts like training from scratch with a smaller step size. The learning rate difference matters more than the lack of momentum in early epochs.

VERSION PyTorch 2.11.x (March 2026) maintains backward compatibility with torch.optim.Adam and torch.optim.SGD APIs from 1.0.0 forward. No breaking changes. Torch.compile() can optimize optimizer updates in some cases, but the API itself is stable.

Learning rates and schedulers: how to change the optimizer's step size during training to improve convergence.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.