nn.ReLU, nn.Sigmoid: activation functions
Why this matters
Without activation functions, stacking layers in a neural network collapses into a single linear transformation, making deep networks useless. Choosing the right activation directly impacts training speed, convergence, and model accuracy.
Explanation
Activation functions are non-linear transformations applied to the output of neurons. They decide whether a neuron should 'fire' or remain dormant, introducing the non-linearity that allows neural networks to learn complex patterns. ReLU (Rectified Linear Unit) returns max(0, x): it passes positive values unchanged and zeros out negative ones. Sigmoid squashes output to (0, 1) using the formula 1/(1+e^-x), useful for binary classification probabilities. Mechanically, when you call a layer like nn.ReLU() in forward pass, it element-wise applies the function to the tensor. ReLU is the modern default for hidden layers because it's computationally cheap and avoids vanishing gradients; Sigmoid was historically preferred but suffers from gradient saturation, making it better suited for output layers in binary classification.
Analogy
Think of ReLU as a gating mechanism: it's either fully open (passes signal) or fully closed (blocks signal). Sigmoid is like a dimmer switch that gradually transitions from off to on, useful when you need a probability that smoothly ranges from impossible to certain.
Code
import torch
import torch.nn as nn
relu = nn.ReLU()
sigmoid = nn.Sigmoid()
input_tensor = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])
relu_output = relu(input_tensor)
sigmoid_output = sigmoid(input_tensor)
print(f"Input: {input_tensor}")
print(f"ReLU output: {relu_output}")
print(f"Sigmoid output: {sigmoid_output}")
model = nn.Sequential(
nn.Linear(5, 3),
nn.ReLU(),
nn.Linear(3, 1),
nn.Sigmoid()
)
sample_input = torch.randn(2, 5)
output = model(sample_input)
print(f"\nModel output shape: {output.shape}")
print(f"Model output (probabilities): {output}") Input: tensor([-2.0, -0.5, 0.0, 0.5, 2.0])
ReLU output: tensor([0.0, 0.0, 0.0, 0.5, 2.0])
Sigmoid output: tensor([0.1192, 0.3775, 0.5000, 0.6225, 0.8808])
Model output shape: torch.Size([2, 1])
Model output (probabilities): tensor([[0.5234],
[0.4891]]) What just happened?
The code instantiated ReLU and Sigmoid activation functions as layers. When passed a tensor of 5 values, ReLU zeroed out negatives and kept positives unchanged. Sigmoid compressed all values into (0, 1) range. Then a 2-layer sequential model was built with Linear → ReLU → Linear → Sigmoid, which transformed a batch of 2 samples from 5 dimensions down to 1 dimension with sigmoid-squashed outputs between 0 and 1.
Common gotcha
The most common mistake is applying activation functions between your model output and loss function. If you do `loss = criterion(sigmoid(model_output), targets)`, you're applying sigmoid twice: PyTorch's BCELoss already applies sigmoid internally when you use BCEWithLogitsLoss. Always check your loss function's documentation: some expect raw logits, some expect pre-activated outputs.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceTypeError: 'Sigmoid' object is not callableExperienced dev note
In production, ReLU variants (LeakyReLU, GELU) are almost always better than vanilla ReLU for hidden layers: they fix the 'dying ReLU' problem where neurons permanently output zero if they start in negative territory. Sigmoid in hidden layers is almost never used anymore; it causes vanishing gradients during backprop because its derivative max at 0.25. Save Sigmoid for binary classification output layers, and use ReLU or GELU everywhere else. This choice often matters more than learning rate tuning.
Check your understanding
Why would applying ReLU to every output in a regression model that predicts house prices be a problem, and what would happen to your predictions?
Show answer hint
A correct answer recognizes that ReLU zeros out negative predictions, making negative prices impossible to predict: regression needs to output any real number, positive or negative. You'd lose the model's ability to predict values below zero, truncating your prediction range.