How to beginner to intermediate · 3 min read

Fix Stable Diffusion slow generation

Q: Fix Stable Diffusion slow generation

Fix slow generation in Stable Diffusion by enabling GPU acceleration with CUDA or ROCm, using optimized models like Stable Diffusion XL or quantized versions, and reducing num_inference_steps. Also, use efficient pipelines such as diffusers with torch.cuda.amp for mixed precision.

Quick answer

Fix slow generation in Stable Diffusion by enabling GPU acceleration with CUDA or ROCm, using optimized models like Stable Diffusion XL or quantized versions, and reducing num_inference_steps. Also, use efficient pipelines such as diffusers with torch.cuda.amp for mixed precision.

PREREQUISITES

Python 3.8+
pip install torch torchvision diffusers transformers accelerate
NVIDIA GPU with CUDA or AMD GPU with ROCm (optional but recommended)

Setup environment

Install the necessary Python packages and ensure your GPU drivers and CUDA toolkit are properly installed for hardware acceleration.

bash

pip install torch torchvision diffusers transformers accelerate

Step by step speedup

Use the diffusers library with GPU and mixed precision to speed up generation. Reduce num_inference_steps and use a smaller or optimized model.

python

import torch
from diffusers import StableDiffusionPipeline

# Load model with GPU and half precision
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Enable faster attention if available
pipe.enable_attention_slicing()

prompt = "A futuristic cityscape at sunset"

# Generate image with fewer steps for speed
image = pipe(prompt, num_inference_steps=20).images[0]

image.save("output.png")
print("Image generated and saved as output.png")

output

Image generated and saved as output.png

Common variations

Use Stable Diffusion XL for better speed-quality tradeoff.
Try quantized models (4-bit or 8-bit) with bitsandbytes for lower VRAM and faster inference.
Use accelerate to optimize device placement and mixed precision automatically.
Run inference asynchronously or batch multiple prompts for throughput.

Troubleshooting tips

If generation is still slow, verify GPU usage with nvidia-smi or system monitors.
Check that torch is installed with CUDA support: torch.cuda.is_available() should return True.
Disable CPU fallback by forcing device="cuda" in pipeline.
Reduce image resolution or batch size to improve speed.

Key Takeaways

Always enable GPU acceleration with CUDA or ROCm for Stable Diffusion inference.
Use optimized models and reduce num_inference_steps to speed up generation.
Leverage mixed precision (float16) and attention slicing to lower memory and increase speed.

Verified 2026-04 · runwayml/stable-diffusion-v1-5, Stable Diffusion XL

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.