High severity intermediate · Fix: 5-10 min

ValueError: shape mismatch

ValueError: Expected ControlNet conditioning image shape (batch, channels, height, width)

What this error means

The conditioning image passed to ControlNet has incorrect dimensions (height/width/channels/batch), causing shape mismatch during forward pass through the control module.

Stack trace

traceback

ValueError: Expected ControlNet conditioning image of shape (batch_size, 3, height, width), but got shape (batch_size, 4, 512, 512). The number of input channels must be 3 (RGB) for standard ControlNet models.

File "/usr/local/lib/python3.11/site-packages/diffusers/models/controlnet.py", line 487, in forward
    raise ValueError(f'Expected ControlNet conditioning image of shape {expected_shape}, but got {conditioning_image.shape}')

diffusers.models.controlnet.ControlNetModel.forward(self, sample, timestep, encoder_hidden_states, controlnet_cond, conditioning_scale, guess_mode, return_dict)

QUICK FIX

Add this preprocessing step before the pipeline call: `conditioning_image = Image.open(image_path).convert('RGB'); conditioning_image = conditioning_image.resize((width, height), Image.LANCZOS); conditioning_image = torch.from_numpy(np.array(conditioning_image)).permute(2, 0, 1).unsqueeze(0) / 255.0` to ensure shape (1, 3, H, W).

Why it happens

ControlNet models expect conditioning images in a specific format: (batch_size, 3, height, width) for RGB images. Common mismatches occur when: (1) the image has 4 channels (RGBA with alpha channel) instead of 3, (2) the spatial dimensions don't match the model's expected size, (3) the image is passed as (height, width, channels) instead of (channels, height, width), or (4) batch dimension is missing or incorrectly shaped. ControlNet is rigid about input shape because it performs pixel-level spatial control over the diffusion process.

Detection

Before passing conditioning images to ControlNet, assert the shape explicitly: `assert conditioning_image.shape == (batch_size, 3, height, width)` and log the actual shape on failure. Add an image preprocessing validation function that checks channels, dimensions, and dtype before instantiating the pipeline.

Causes & fixes

Conditioning image has 4 channels (RGBA with alpha) instead of 3 (RGB)

✓ Fix

Convert RGBA to RGB by dropping the alpha channel: `conditioning_image = conditioning_image.convert('RGB')` if using PIL, or `conditioning_image = conditioning_image[:3, :, :]` if using torch tensors with shape (C, H, W)

Image dimensions (height/width) don't match the model's expected size or the prompt_image_height/prompt_image_width parameters

✓ Fix

Resize the conditioning image to match your model configuration before pipeline creation: `from torchvision.transforms import Resize; resize = Resize((height, width)); conditioning_image = resize(conditioning_image)` or use PIL: `conditioning_image = conditioning_image.resize((width, height), Image.LANCZOS)`

Image tensor has shape (height, width, channels) instead of (batch, channels, height, width)

✓ Fix

Transpose and add batch dimension: `conditioning_image = torch.from_numpy(np.array(pil_image)).permute(2, 0, 1).unsqueeze(0).float()` to get shape (1, 3, H, W)

Batch dimension is missing or incorrectly sized (e.g., shape is (3, 512, 512) instead of (1, 3, 512, 512))

✓ Fix

Add batch dimension with unsqueeze: `conditioning_image = conditioning_image.unsqueeze(0)` if using torch, or convert list of images: `conditioning_images = torch.stack([preprocess_image(img) for img in image_list])` for batch processing

Code: broken vs fixed

Broken - triggers the error

python

import torch
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
from PIL import Image
import numpy as np
import os

# Load ControlNet and pipeline
controlnet = ControlNetModel.from_pretrained(
    'thibaud/controlnet-sd21',
    torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    controlnet=controlnet,
    torch_dtype=torch.float16
).to('cuda')

# Load image with alpha channel (RGBA)
control_image = Image.open('control.png')  # RGBA image, 4 channels
print(f'Image mode: {control_image.mode}, size: {control_image.size}')

# BROKEN: Pass RGBA image directly without converting to RGB
control_image_tensor = torch.from_numpy(np.array(control_image)).permute(2, 0, 1).unsqueeze(0).float() / 255.0
print(f'Tensor shape: {control_image_tensor.shape}')  # Shape: (1, 4, 512, 512) — WRONG!

# This line raises ValueError: shape mismatch
output = pipe(
    prompt='a beautiful landscape',
    image=control_image_tensor,
    height=768,
    width=768,
    num_inference_steps=50
).images[0]

Fixed - works correctly

python

import torch
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
from PIL import Image
import numpy as np
import os

# Load ControlNet and pipeline
controlnet = ControlNetModel.from_pretrained(
    'thibaud/controlnet-sd21',
    torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    controlnet=controlnet,
    torch_dtype=torch.float16
).to('cuda')

# Load image and convert RGBA to RGB
control_image = Image.open('control.png')  # RGBA image, 4 channels
control_image = control_image.convert('RGB')  # FIX: Convert to RGB (3 channels)
print(f'Image mode: {control_image.mode}, size: {control_image.size}')  # Now: RGB, (512, 512)

# Resize to match expected dimensions
height, width = 768, 768
control_image = control_image.resize((width, height), Image.LANCZOS)  # FIX: Resize to match pipeline height/width

# Preprocess to tensor with correct shape (1, 3, 768, 768)
control_image_tensor = torch.from_numpy(np.array(control_image)).permute(2, 0, 1).unsqueeze(0).float() / 255.0
print(f'Tensor shape: {control_image_tensor.shape}')  # Shape: (1, 3, 768, 768) — CORRECT!

# Now this works
output = pipe(
    prompt='a beautiful landscape',
    image=control_image_tensor,
    height=768,
    width=768,
    num_inference_steps=50
).images[0]

print('Success! Generated image with ControlNet conditioning.')

Added .convert('RGB') to drop the alpha channel and match ControlNet's expected 3-channel input, then resized the image to match pipeline dimensions before converting to tensor. This ensures the final tensor shape is (1, 3, 768, 768) instead of (1, 4, 512, 512).

⚠

Workaround

If you must work with RGBA images and can't modify preprocessing, write a custom wrapper that catches the shape mismatch error and auto-converts: `try: output = pipe(image=conditioning_image, ...); except ValueError as e: if 'shape' in str(e): conditioning_image = Image.fromarray((np.array(conditioning_image)[:, :, :3] * 255).astype(np.uint8)); output = pipe(image=conditioning_image, ...)`: but this is fragile; prefer fixing preprocessing instead.

✓

Prevention

Create a reusable preprocessing function that handles all shape conversions centrally: `def prepare_controlnet_image(image_path, height, width): img = Image.open(image_path).convert('RGB').resize((width, height), Image.LANCZOS); tensor = torch.from_numpy(np.array(img)).permute(2, 0, 1).unsqueeze(0).float() / 255.0; assert tensor.shape == (1, 3, height, width), f'Shape mismatch: {tensor.shape}'; return tensor`. Call this function for all conditioning images before passing to the pipeline, and add unit tests that verify shape output for RGB, RGBA, and grayscale inputs.

Python 3.9+ · diffusers >=0.27.0 · tested on 0.28.x

Verified 2026-04 · thibaud/controlnet-sd21, stabilityai/stable-diffusion-xl-base-1.0

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.