nn.Conv2d: parameters explained
Why this matters
Conv2d is the foundation of every computer vision model you'll build. Misunderstanding its parameters leads to shape mismatches, wrong receptive fields, and silent performance bugs in production. You need to know exactly what in_channels, out_channels, kernel_size, stride, and padding actually do to debug real networks.
Explanation
What it is: nn.Conv2d applies a sliding window (filter/kernel) of learnable weights across a 2D spatial input. The filter performs element-wise multiplication and summation at each position, producing a new feature map. Multiple filters are stacked to produce multiple output channels. How it works mechanically: Given input shape (batch, in_channels, height, width), Conv2d creates out_channels independent filters, each with shape (in_channels, kernel_size[0], kernel_size[1]). The filter slides across the spatial dimensions using stride (step size) and padding (zeros added around edges). At each position, the convolution produces one output value per filter. The output shape is (batch, out_channels, new_height, new_width), where new dimensions depend on stride and padding. When to use it: Use Conv2d as the primary building block in CNNs for images. Stack multiple Conv2d layers to increase receptive field and extract hierarchical features. Reduce spatial dimensions using stride > 1 or max pooling between Conv2d layers.
Analogy
Think of Conv2d as a scanning window with glasses on. Each pair of glasses (filter) looks for a different pattern (edge, corner, texture). As you slide the glasses across a photo, they get excited (high activation) when they see their pattern. More glasses (out_channels) means more patterns you can detect. Bigger glasses (larger kernel_size) see larger areas. Walking in bigger steps (stride > 1) means fewer positions to check.
Code
import torch
import torch.nn as nn
conv_layer = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
stride=1,
padding=1
)
print(f"Conv2d layer: {conv_layer}")
print(f"\nWeight shape: {conv_layer.weight.shape}")
print(f"Bias shape: {conv_layer.bias.shape}")
print(f"Number of parameters: {sum(p.numel() for p in conv_layer.parameters())}")
batch_size = 2
input_tensor = torch.randn(batch_size, 3, 32, 32)
print(f"\nInput shape: {input_tensor.shape}")
output = conv_layer(input_tensor)
print(f"Output shape: {output.shape}")
print(f"Output dtype: {output.dtype}")
conv_stride2 = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
stride=2,
padding=1
)
output_stride2 = conv_stride2(input_tensor)
print(f"\nWith stride=2, output shape: {output_stride2.shape}")
conv_no_padding = nn.Conv2d(
in_channels=3,
out_channels=16,
kernel_size=3,
stride=1,
padding=0
)
output_no_pad = conv_no_padding(input_tensor)
print(f"With padding=0, output shape: {output_no_pad.shape}") Conv2d layer: Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) Weight shape: torch.Size([16, 3, 3, 3]) Bias shape: torch.Size([16]) Number of parameters: 448 Input shape: torch.Size([2, 3, 32, 32]) Output shape: torch.Size([2, 16, 32, 32]) Output dtype: torch.float32 With stride=2, output shape: torch.Size([2, 16, 16, 16]) With stride=0, output shape: torch.Size([2, 16, 30, 30])
What just happened?
We created three Conv2d layers with different configurations and traced how they transform input shapes. The first layer kept spatial dimensions (padding=1 compensates for kernel size). The second layer halved spatial dimensions (stride=2). The third layer shrunk dimensions without padding (padding=0 means 2 pixels lost per dimension from a 3×3 kernel). The weight tensor shape shows 16 output filters, each receiving 3 input channels and containing a 3×3 kernel. We traced a single forward pass through (batch_size=2, channels=3, height=32, width=32) input and confirmed output shapes match the formula.
Common gotcha
Developers assume padding=0 is default and get surprise shape mismatches. For a 3×3 kernel with stride=1 and no padding, height shrinks by 2 (same for width). Use the formula: output_size = floor((input_size - kernel_size + 2*padding) / stride + 1). Many devs also forget that weight shape is [out_channels, in_channels, kernel_h, kernel_w], NOT [in_channels, out_channels, ...]. This matters when debugging layer connections.
Error recovery
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2dRuntimeError: Expected in_channels=3, but got in_channels=1Shape mismatch in subsequent layerExperienced dev note
In practice, always use padding='same' (available via padding calculation: (kernel_size - 1) // 2 for odd kernels) when you want to preserve spatial dimensions, or be explicit about when you're downsampling. Many models use Conv2d with stride=2 instead of separate pooling layers for downsampling: this is now standard. Also, memorize the parameter count formula: (in_channels * kernel_h * kernel_w + 1) * out_channels. You'll calculate this constantly when designing architectures and estimating memory usage. The '+1' is the bias. Finally, Conv2d is almost never used alone: it's always part of a block (Conv2d → BatchNorm → ReLU).
Check your understanding
You have a Conv2d layer with in_channels=3, out_channels=64, kernel_size=5, stride=1, padding=2, and input shape (batch=8, channels=3, height=28, width=28). What is the exact output shape, and how many learnable parameters does this layer have? Why does padding=2 matter here given the kernel_size=5?
Show answer hint
Output shape uses the formula with padding and stride. Calculate new_height = floor((28 - 5 + 2*2) / 1 + 1). Parameter count is (in_channels * kernel_h * kernel_w + 1) * out_channels. Padding=2 preserves input spatial dimensions (without it, height would shrink by 4), which is critical for residual connections and maintaining feature map size through the network.