Code Intermediate medium · 6 min

nn.Conv2d: parameters explained

What you will learn

Conv2d slides a learnable filter across an image to extract spatial features, and its parameters control filter size, count, stride, and padding behavior.

Why this matters

Conv2d is the foundation of every computer vision model you'll build. Misunderstanding its parameters leads to shape mismatches, wrong receptive fields, and silent performance bugs in production. You need to know exactly what in_channels, out_channels, kernel_size, stride, and padding actually do to debug real networks.

Skip if: Don't use Conv2d for 1D sequential data (use Conv1d) or 3D volumetric data (use Conv3d). Don't use Conv2d when you need a fully connected layer: Conv2d assumes spatial structure. Don't use Conv2d as your first layer if your input isn't an image-like tensor with shape (batch, channels, height, width).

Explanation

What it is: nn.Conv2d applies a sliding window (filter/kernel) of learnable weights across a 2D spatial input. The filter performs element-wise multiplication and summation at each position, producing a new feature map. Multiple filters are stacked to produce multiple output channels. How it works mechanically: Given input shape (batch, in_channels, height, width), Conv2d creates out_channels independent filters, each with shape (in_channels, kernel_size[0], kernel_size[1]). The filter slides across the spatial dimensions using stride (step size) and padding (zeros added around edges). At each position, the convolution produces one output value per filter. The output shape is (batch, out_channels, new_height, new_width), where new dimensions depend on stride and padding. When to use it: Use Conv2d as the primary building block in CNNs for images. Stack multiple Conv2d layers to increase receptive field and extract hierarchical features. Reduce spatial dimensions using stride > 1 or max pooling between Conv2d layers.

Analogy

Think of Conv2d as a scanning window with glasses on. Each pair of glasses (filter) looks for a different pattern (edge, corner, texture). As you slide the glasses across a photo, they get excited (high activation) when they see their pattern. More glasses (out_channels) means more patterns you can detect. Bigger glasses (larger kernel_size) see larger areas. Walking in bigger steps (stride > 1) means fewer positions to check.

Code

python

import torch
import torch.nn as nn

conv_layer = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=1,
    padding=1
)

print(f"Conv2d layer: {conv_layer}")
print(f"\nWeight shape: {conv_layer.weight.shape}")
print(f"Bias shape: {conv_layer.bias.shape}")
print(f"Number of parameters: {sum(p.numel() for p in conv_layer.parameters())}")

batch_size = 2
input_tensor = torch.randn(batch_size, 3, 32, 32)
print(f"\nInput shape: {input_tensor.shape}")

output = conv_layer(input_tensor)
print(f"Output shape: {output.shape}")
print(f"Output dtype: {output.dtype}")

conv_stride2 = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=2,
    padding=1
)
output_stride2 = conv_stride2(input_tensor)
print(f"\nWith stride=2, output shape: {output_stride2.shape}")

conv_no_padding = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=1,
    padding=0
)
output_no_pad = conv_no_padding(input_tensor)
print(f"With padding=0, output shape: {output_no_pad.shape}")

Output

Conv2d layer: Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

Weight shape: torch.Size([16, 3, 3, 3])
Bias shape: torch.Size([16])
Number of parameters: 448

Input shape: torch.Size([2, 3, 32, 32])
Output shape: torch.Size([2, 16, 32, 32])
Output dtype: torch.float32

With stride=2, output shape: torch.Size([2, 16, 16, 16])
With stride=0, output shape: torch.Size([2, 16, 30, 30])

What just happened?

We created three Conv2d layers with different configurations and traced how they transform input shapes. The first layer kept spatial dimensions (padding=1 compensates for kernel size). The second layer halved spatial dimensions (stride=2). The third layer shrunk dimensions without padding (padding=0 means 2 pixels lost per dimension from a 3×3 kernel). The weight tensor shape shows 16 output filters, each receiving 3 input channels and containing a 3×3 kernel. We traced a single forward pass through (batch_size=2, channels=3, height=32, width=32) input and confirmed output shapes match the formula.

Common gotcha

Developers assume padding=0 is default and get surprise shape mismatches. For a 3×3 kernel with stride=1 and no padding, height shrinks by 2 (same for width). Use the formula: output_size = floor((input_size - kernel_size + 2*padding) / stride + 1). Many devs also forget that weight shape is [out_channels, in_channels, kernel_h, kernel_w], NOT [in_channels, out_channels, ...]. This matters when debugging layer connections.

Error recovery

RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d

You passed wrong input shape. Conv2d expects (batch, in_channels, height, width). If you have a single image, add batch dimension: input.unsqueeze(0) or reshape to (1, channels, h, w).

RuntimeError: Expected in_channels=3, but got in_channels=1

Your input tensor has wrong number of channels. If loading RGB image but model expects 3 channels, verify image load didn't convert to grayscale. Use img.convert('RGB') in PIL or check cv2.imread mode.

Shape mismatch in subsequent layer

Output spatial dimensions don't match what next layer expects. Use the formula: output_h = floor((input_h - kernel_size + 2*padding) / stride + 1). Double-check stride and padding values. Common: stride=1 with padding=0 shrinks dimensions faster than expected.

Experienced dev note

In practice, always use padding='same' (available via padding calculation: (kernel_size - 1) // 2 for odd kernels) when you want to preserve spatial dimensions, or be explicit about when you're downsampling. Many models use Conv2d with stride=2 instead of separate pooling layers for downsampling: this is now standard. Also, memorize the parameter count formula: (in_channels * kernel_h * kernel_w + 1) * out_channels. You'll calculate this constantly when designing architectures and estimating memory usage. The '+1' is the bias. Finally, Conv2d is almost never used alone: it's always part of a block (Conv2d → BatchNorm → ReLU).

Check your understanding

You have a Conv2d layer with in_channels=3, out_channels=64, kernel_size=5, stride=1, padding=2, and input shape (batch=8, channels=3, height=28, width=28). What is the exact output shape, and how many learnable parameters does this layer have? Why does padding=2 matter here given the kernel_size=5?

Show answer hint

Output shape uses the formula with padding and stride. Calculate new_height = floor((28 - 5 + 2*2) / 1 + 1). Parameter count is (in_channels * kernel_h * kernel_w + 1) * out_channels. Padding=2 preserves input spatial dimensions (without it, height would shrink by 4), which is critical for residual connections and maintaining feature map size through the network.

VERSION Conv2d API is stable in PyTorch 2.11.x. No breaking changes from 2.6.x. The padding_mode parameter (zero, reflect, replicate, circular) is available and useful, but the core behavior is unchanged.

Next, you'll learn how to stack Conv2d layers with BatchNorm and activation functions into reusable blocks, then trace a complete forward pass through a small CNN to understand receptive field growth.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.