How to Intermediate · 3 min read

Fix Stable Diffusion CUDA out of memory

Quick answer
Fix CUDA out of memory errors in Stable Diffusion by reducing batch_size, enabling torch.cuda.amp mixed precision, and using memory-efficient attention like xformers. Also, clear GPU cache with torch.cuda.empty_cache() and consider lowering height and width parameters to reduce VRAM usage.

PREREQUISITES

  • Python 3.8+
  • PyTorch with CUDA support
  • Stable Diffusion installed (e.g., via diffusers or AUTOMATIC1111)
  • NVIDIA GPU with CUDA drivers installed

Setup

Ensure you have torch installed with CUDA support and the latest diffusers or Stable Diffusion repository. Install xformers for memory-efficient attention to reduce VRAM usage.

bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install diffusers transformers accelerate
pip install xformers

Step by step

Use the following Python script to run Stable Diffusion with reduced VRAM usage by enabling mixed precision and memory-efficient attention, and lowering batch size and image dimensions.

python
import torch
from diffusers import StableDiffusionPipeline

model_id = "runwayml/stable-diffusion-v1-5"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Enable memory efficient attention if available
try:
    pipe.enable_xformers_memory_efficient_attention()
except Exception as e:
    print(f"xformers not available: {e}")

prompt = "A fantasy landscape, trending on artstation"

# Reduce batch size and image size to fit GPU memory
batch_size = 1
height = 512
width = 512

with torch.cuda.amp.autocast():
    for _ in range(batch_size):
        image = pipe(prompt, height=height, width=width).images[0]
        image.save("output.png")

# Clear cache after generation
torch.cuda.empty_cache()
output
Saving output.png with generated image

Common variations

  • Use pipe.enable_attention_slicing() to further reduce VRAM by slicing attention computation.
  • Lower height and width parameters (e.g., 384x384) for less memory usage.
  • Run inference on CPU if GPU memory is insufficient (much slower).
  • Use batch_size=1 always for limited VRAM.
python
pipe.enable_attention_slicing()

Troubleshooting

  • If you get CUDA out of memory errors, reduce batch_size or image resolution.
  • Restart your Python kernel or script to clear GPU memory.
  • Run torch.cuda.empty_cache() before and after generation to free unused memory.
  • Check your GPU VRAM usage with nvidia-smi to monitor memory consumption.
  • Update your GPU drivers and CUDA toolkit to latest versions for better memory management.
python
import torch
torch.cuda.empty_cache()

Key Takeaways

  • Reduce batch size and image resolution to fit your GPU VRAM limits.
  • Enable mixed precision with torch.autocast and memory-efficient attention via xformers.
  • Use pipe.enable_attention_slicing() to lower peak memory usage during inference.
  • Clear GPU cache with torch.cuda.empty_cache() to avoid fragmentation.
  • Monitor GPU memory with nvidia-smi and update drivers regularly.
Verified 2026-04 · runwayml/stable-diffusion-v1-5
Verify ↗