ValueError: shape mismatch
ValueError: Expected ControlNet conditioning image shape (batch, channels, height, width)
Stack trace
ValueError: Expected ControlNet conditioning image of shape (batch_size, 3, height, width), but got shape (batch_size, 4, 512, 512). The number of input channels must be 3 (RGB) for standard ControlNet models.
File "/usr/local/lib/python3.11/site-packages/diffusers/models/controlnet.py", line 487, in forward
raise ValueError(f'Expected ControlNet conditioning image of shape {expected_shape}, but got {conditioning_image.shape}')
diffusers.models.controlnet.ControlNetModel.forward(self, sample, timestep, encoder_hidden_states, controlnet_cond, conditioning_scale, guess_mode, return_dict) Why it happens
ControlNet models expect conditioning images in a specific format: (batch_size, 3, height, width) for RGB images. Common mismatches occur when: (1) the image has 4 channels (RGBA with alpha channel) instead of 3, (2) the spatial dimensions don't match the model's expected size, (3) the image is passed as (height, width, channels) instead of (channels, height, width), or (4) batch dimension is missing or incorrectly shaped. ControlNet is rigid about input shape because it performs pixel-level spatial control over the diffusion process.
Detection
Before passing conditioning images to ControlNet, assert the shape explicitly: `assert conditioning_image.shape == (batch_size, 3, height, width)` and log the actual shape on failure. Add an image preprocessing validation function that checks channels, dimensions, and dtype before instantiating the pipeline.
Causes & fixes
Conditioning image has 4 channels (RGBA with alpha) instead of 3 (RGB)
Convert RGBA to RGB by dropping the alpha channel: `conditioning_image = conditioning_image.convert('RGB')` if using PIL, or `conditioning_image = conditioning_image[:3, :, :]` if using torch tensors with shape (C, H, W)
Image dimensions (height/width) don't match the model's expected size or the prompt_image_height/prompt_image_width parameters
Resize the conditioning image to match your model configuration before pipeline creation: `from torchvision.transforms import Resize; resize = Resize((height, width)); conditioning_image = resize(conditioning_image)` or use PIL: `conditioning_image = conditioning_image.resize((width, height), Image.LANCZOS)`
Image tensor has shape (height, width, channels) instead of (batch, channels, height, width)
Transpose and add batch dimension: `conditioning_image = torch.from_numpy(np.array(pil_image)).permute(2, 0, 1).unsqueeze(0).float()` to get shape (1, 3, H, W)
Batch dimension is missing or incorrectly sized (e.g., shape is (3, 512, 512) instead of (1, 3, 512, 512))
Add batch dimension with unsqueeze: `conditioning_image = conditioning_image.unsqueeze(0)` if using torch, or convert list of images: `conditioning_images = torch.stack([preprocess_image(img) for img in image_list])` for batch processing
Code: broken vs fixed
import torch
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
from PIL import Image
import numpy as np
import os
# Load ControlNet and pipeline
controlnet = ControlNetModel.from_pretrained(
'thibaud/controlnet-sd21',
torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-xl-base-1.0',
controlnet=controlnet,
torch_dtype=torch.float16
).to('cuda')
# Load image with alpha channel (RGBA)
control_image = Image.open('control.png') # RGBA image, 4 channels
print(f'Image mode: {control_image.mode}, size: {control_image.size}')
# BROKEN: Pass RGBA image directly without converting to RGB
control_image_tensor = torch.from_numpy(np.array(control_image)).permute(2, 0, 1).unsqueeze(0).float() / 255.0
print(f'Tensor shape: {control_image_tensor.shape}') # Shape: (1, 4, 512, 512) — WRONG!
# This line raises ValueError: shape mismatch
output = pipe(
prompt='a beautiful landscape',
image=control_image_tensor,
height=768,
width=768,
num_inference_steps=50
).images[0] import torch
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
from PIL import Image
import numpy as np
import os
# Load ControlNet and pipeline
controlnet = ControlNetModel.from_pretrained(
'thibaud/controlnet-sd21',
torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-xl-base-1.0',
controlnet=controlnet,
torch_dtype=torch.float16
).to('cuda')
# Load image and convert RGBA to RGB
control_image = Image.open('control.png') # RGBA image, 4 channels
control_image = control_image.convert('RGB') # FIX: Convert to RGB (3 channels)
print(f'Image mode: {control_image.mode}, size: {control_image.size}') # Now: RGB, (512, 512)
# Resize to match expected dimensions
height, width = 768, 768
control_image = control_image.resize((width, height), Image.LANCZOS) # FIX: Resize to match pipeline height/width
# Preprocess to tensor with correct shape (1, 3, 768, 768)
control_image_tensor = torch.from_numpy(np.array(control_image)).permute(2, 0, 1).unsqueeze(0).float() / 255.0
print(f'Tensor shape: {control_image_tensor.shape}') # Shape: (1, 3, 768, 768) — CORRECT!
# Now this works
output = pipe(
prompt='a beautiful landscape',
image=control_image_tensor,
height=768,
width=768,
num_inference_steps=50
).images[0]
print('Success! Generated image with ControlNet conditioning.') Workaround
If you must work with RGBA images and can't modify preprocessing, write a custom wrapper that catches the shape mismatch error and auto-converts: `try: output = pipe(image=conditioning_image, ...); except ValueError as e: if 'shape' in str(e): conditioning_image = Image.fromarray((np.array(conditioning_image)[:, :, :3] * 255).astype(np.uint8)); output = pipe(image=conditioning_image, ...)`: but this is fragile; prefer fixing preprocessing instead.
Prevention
Create a reusable preprocessing function that handles all shape conversions centrally: `def prepare_controlnet_image(image_path, height, width): img = Image.open(image_path).convert('RGB').resize((width, height), Image.LANCZOS); tensor = torch.from_numpy(np.array(img)).permute(2, 0, 1).unsqueeze(0).float() / 255.0; assert tensor.shape == (1, 3, height, width), f'Shape mismatch: {tensor.shape}'; return tensor`. Call this function for all conditioning images before passing to the pipeline, and add unit tests that verify shape output for RGB, RGBA, and grayscale inputs.