nn.Module: the base class for all networks
Why this matters
Every neural network you build in PyTorch inherits from nn.Module: understanding how it works is the foundation for building anything from simple classifiers to transformer models. Without this, you won't know how to structure code that PyTorch can actually train.
Explanation
What it is: nn.Module is PyTorch's base class for all neural network components. When you create a network, you inherit from it and define two things: what parameters your model has (in __init__) and how data flows through them (in forward()). How it works: When you call model(input), PyTorch automatically routes it to your forward() method. Parameters declared as nn.Parameter or inside other nn.Module objects are automatically tracked for gradient computation during backprop. The module also manages device placement (CPU/GPU), training/evaluation modes, and parameter initialization. When to use it: Use nn.Module for anything with learnable weights: linear layers, convolutions, embeddings, attention blocks, or any custom layer you want to train.
Analogy
Think of nn.Module like a recipe card. The <code>__init__</code> method lists your ingredients (parameters like weights and biases). The <code>forward()</code> method is the cooking instructions that say how to combine those ingredients. PyTorch's training loop is the chef who reads the card, executes the recipe, tastes the output (loss), and adjusts the ingredients for next time.
Code
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleNet(input_size=10, hidden_size=32, output_size=2)
input_tensor = torch.randn(4, 10)
output = model(input_tensor)
print(f"Input shape: {input_tensor.shape}")
print(f"Output shape: {output.shape}")
print(f"\nModel parameters:")
for name, param in model.named_parameters():
print(f" {name}: {param.shape}")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}") Input shape: torch.Size([4, 10]) Output shape: torch.Size([4, 2]) Model parameters: fc1.weight: torch.Size([32, 10]) fc1.bias: torch.Size([32]) fc2.weight: torch.Size([2, 32]) fc2.bias: torch.Size([2]) Total parameters: 738
What just happened?
We defined a custom network class that inherits from nn.Module with two linear layers and a ReLU activation. We instantiated it with specific dimensions, created a random batch of 4 samples with 10 features each, passed it through the model via the forward pass, and verified that the output shape matched what we expected (batch size 4, 2 output classes). PyTorch automatically tracked all the weight and bias parameters: we can iterate over them and see they exist in memory ready for gradient computation.
Common gotcha
The most common mistake is forgetting to call super().__init__() at the start of your __init__ method. Without it, PyTorch's internal bookkeeping breaks and your parameters won't be registered: they'll exist in your object but won't show up in model.parameters(), won't move to GPU with model.to('cuda'), and won't get updated during training. You'll get silent failures, not error messages.
Error recovery
AttributeError: 'YourNet' object has no attribute 'fc1'RuntimeError: Expected all tensors to be on the same deviceTypeError: forward() missing 1 required positional argument: 'x'Experienced dev note
A subtle thing: when you inherit from nn.Module and define child modules (like self.fc1 = nn.Linear(...)), PyTorch uses Python's descriptor protocol and object inspection to auto-register them. This only works if you assign them as direct attributes in __init__. If you build a list like self.layers = [nn.Linear(...), nn.Linear(...)] and append to it, those parameters won't be tracked: use nn.ModuleList instead. Similarly for dictionaries, use nn.ModuleDict. This catches even experienced developers when they refactor code for flexibility.
Check your understanding
If you moved your model to GPU with model.to('cuda'), but accidentally created a new parameter inside the forward() method (like a learnable scale factor initialized fresh in forward), would that parameter be on GPU or CPU, and why would that break training?
Show answer hint
A correct answer recognizes that parameters created in forward() would be on CPU (default device) while your input and other parameters are on GPU, causing a device mismatch error. More importantly, it identifies that this breaks the principle that parameters must be created in __init__ so they're registered and tracked properly.