Code Beginner easy · 5 min

nn.ReLU, nn.Sigmoid: activation functions

What you will learn

Activation functions introduce non-linearity to neural networks by applying mathematical transformations to neuron outputs.

Why this matters

Without activation functions, stacking layers in a neural network collapses into a single linear transformation, making deep networks useless. Choosing the right activation directly impacts training speed, convergence, and model accuracy.

Skip if: You don't need activation functions on the final output layer of a regression model predicting continuous values, and you shouldn't apply them between the model and loss function: PyTorch losses like MSELoss expect raw logits, not activated outputs.

Explanation

Activation functions are non-linear transformations applied to the output of neurons. They decide whether a neuron should 'fire' or remain dormant, introducing the non-linearity that allows neural networks to learn complex patterns. ReLU (Rectified Linear Unit) returns max(0, x): it passes positive values unchanged and zeros out negative ones. Sigmoid squashes output to (0, 1) using the formula 1/(1+e^-x), useful for binary classification probabilities. Mechanically, when you call a layer like nn.ReLU() in forward pass, it element-wise applies the function to the tensor. ReLU is the modern default for hidden layers because it's computationally cheap and avoids vanishing gradients; Sigmoid was historically preferred but suffers from gradient saturation, making it better suited for output layers in binary classification.

Analogy

Think of ReLU as a gating mechanism: it's either fully open (passes signal) or fully closed (blocks signal). Sigmoid is like a dimmer switch that gradually transitions from off to on, useful when you need a probability that smoothly ranges from impossible to certain.

Code

python

import torch
import torch.nn as nn

relu = nn.ReLU()
sigmoid = nn.Sigmoid()

input_tensor = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

relu_output = relu(input_tensor)
sigmoid_output = sigmoid(input_tensor)

print(f"Input: {input_tensor}")
print(f"ReLU output: {relu_output}")
print(f"Sigmoid output: {sigmoid_output}")

model = nn.Sequential(
    nn.Linear(5, 3),
    nn.ReLU(),
    nn.Linear(3, 1),
    nn.Sigmoid()
)

sample_input = torch.randn(2, 5)
output = model(sample_input)
print(f"\nModel output shape: {output.shape}")
print(f"Model output (probabilities): {output}")

Output

Input: tensor([-2.0, -0.5,  0.0,  0.5,  2.0])
ReLU output: tensor([0.0, 0.0, 0.0, 0.5, 2.0])
Sigmoid output: tensor([0.1192, 0.3775, 0.5000, 0.6225, 0.8808])

Model output shape: torch.Size([2, 1])
Model output (probabilities): tensor([[0.5234],
        [0.4891]])

What just happened?

The code instantiated ReLU and Sigmoid activation functions as layers. When passed a tensor of 5 values, ReLU zeroed out negatives and kept positives unchanged. Sigmoid compressed all values into (0, 1) range. Then a 2-layer sequential model was built with Linear → ReLU → Linear → Sigmoid, which transformed a batch of 2 samples from 5 dimensions down to 1 dimension with sigmoid-squashed outputs between 0 and 1.

Common gotcha

The most common mistake is applying activation functions between your model output and loss function. If you do `loss = criterion(sigmoid(model_output), targets)`, you're applying sigmoid twice: PyTorch's BCELoss already applies sigmoid internally when you use BCEWithLogitsLoss. Always check your loss function's documentation: some expect raw logits, some expect pre-activated outputs.

Error recovery

RuntimeError: Expected all tensors to be on the same device

If your activation layer is on GPU but input tensor is on CPU, move input to same device with input_tensor.to(device) or model.to(device).

TypeError: 'Sigmoid' object is not callable

This happens if you forgot parentheses: use nn.Sigmoid() not nn.Sigmoid. The parentheses instantiate the layer; without them you have the class, not an instance.

Experienced dev note

In production, ReLU variants (LeakyReLU, GELU) are almost always better than vanilla ReLU for hidden layers: they fix the 'dying ReLU' problem where neurons permanently output zero if they start in negative territory. Sigmoid in hidden layers is almost never used anymore; it causes vanishing gradients during backprop because its derivative max at 0.25. Save Sigmoid for binary classification output layers, and use ReLU or GELU everywhere else. This choice often matters more than learning rate tuning.

Check your understanding

Why would applying ReLU to every output in a regression model that predicts house prices be a problem, and what would happen to your predictions?

Show answer hint

A correct answer recognizes that ReLU zeros out negative predictions, making negative prices impossible to predict: regression needs to output any real number, positive or negative. You'd lose the model's ability to predict values below zero, truncating your prediction range.

VERSION No breaking changes between PyTorch 2.6.x and 2.11.x for activation functions. The nn.ReLU and nn.Sigmoid APIs are stable.

Next, explore nn.Linear layers to understand how activations connect to dense transformations, or learn about weight initialization strategies that work best with different activation functions.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.