High severity intermediate · Fix: 5-10 min

RuntimeError / OutOfMemoryError

torch.cuda.OutOfMemoryError or RuntimeError during enable_model_cpu_offload()

What this error means

enable_model_cpu_offload() fails to reduce VRAM usage because GPU memory isn't freed between module transfers, or the method isn't applied correctly to all pipeline components.

Stack trace

traceback

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB on cuda:0 (12.00 GiB total) but only 1.50 GiB left.

During handling of the above exception, another exception occurred:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB on cuda:0. GPU 0 has a total capacty of 12.00 GiB of which 1.50 GiB is free. Of the allocated memory 8.50 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved memory is causing the problem, try setting max_split_size_mb to avoid fragmentation.
  File "diffusers/pipelines/stable_diffusion_xl.py", line 512, in __call__
    latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
RuntimeError: CUDA out of memory. Tried to allocate...

QUICK FIX

Call pipe.enable_attention_slicing() followed by pipe.enable_model_cpu_offload() immediately after loading the pipeline, then generate with batch_size=1 and call torch.cuda.empty_cache() between runs.

Why it happens

enable_model_cpu_offload() moves pipeline components (text encoder, unet, vae) to CPU between inference steps to save VRAM, but it only works if: (1) you call it on the pipeline object after initialization, (2) all components support offloading, and (3) PyTorch's memory management isn't fragmented. If you call it too early, skip some components, or don't use enable_attention_slicing() alongside it, VRAM still overflows. Additionally, SDXL and newer models have larger unet modules that exceed 8GB cards even with offloading.

Detection

Before running inference, log GPU memory with torch.cuda.memory_allocated() before and after enable_model_cpu_offload(): memory should drop to <500MB if offloading worked. If it stays high, offloading didn't activate. Add a try/except around the first inference call to catch OutOfMemoryError early and identify which component exceeded memory.

Causes & fixes

enable_model_cpu_offload() called on pipeline but components weren't actually moved to CPU

✓ Fix

Call pipe.enable_model_cpu_offload() AFTER loading the pipeline, NOT before. Then verify it worked with print(pipe.unet.device): should show 'cpu' when not in use. If still 'cuda', offloading didn't activate: check diffusers version is >=0.21.0.

SDXL/large models exceed VRAM even with CPU offloading because unet+text_encoder are too large for single GPU simultaneously

✓ Fix

Combine enable_model_cpu_offload() with enable_attention_slicing() to reduce peak memory: pipe.enable_attention_slicing() THEN pipe.enable_model_cpu_offload(). For SDXL on 6GB GPUs, also set pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) to process latents in smaller batches.

Using batch_size>1 without gradient checkpointing, causing unet to demand >12GB VRAM even after offloading

✓ Fix

Set batch_size=1 for inference. If you must use batch>1, call pipe.enable_attention_slicing() and pipe.unet.enable_gradient_checkpointing() BEFORE inference to reduce peak memory by 30-40%.

GPU memory fragmented from previous inference runs: offloading can't find contiguous memory blocks

✓ Fix

Call torch.cuda.empty_cache() between inference calls: `torch.cuda.empty_cache()` after each image generation. If memory still fragments, restart the Python process entirely.

Code: broken vs fixed

Broken - triggers the error

python

import torch
import os
from diffusers import StableDiffusionXLPipeline

model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
pipe = StableDiffusionXLPipeline.from_pretrained(
    model_id, torch_dtype=torch.float16, use_safetensors=True
)
pipe = pipe.to('cuda')
# BROKEN: enable_model_cpu_offload() called but offloading doesn't fully activate
# because attention slicing isn't enabled, causing OOM on large models
pipe.enable_model_cpu_offload()

prompt = 'a cinematic shot of a desert landscape'
image = pipe(prompt, height=1024, width=1024).images[0]  # ← OutOfMemoryError here
image.save('output.png')

Fixed - works correctly

python

import torch
import os
from diffusers import StableDiffusionXLPipeline

model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
pipe = StableDiffusionXLPipeline.from_pretrained(
    model_id, torch_dtype=torch.float16, use_safetensors=True
)
pipe = pipe.to('cuda')

# FIXED: Enable attention slicing FIRST (reduces peak memory per attention op)
# then enable CPU offloading (moves components to CPU between steps)
pipe.enable_attention_slicing()
pipe.enable_model_cpu_offload()

prompt = 'a cinematic shot of a desert landscape'
image = pipe(prompt, height=1024, width=1024).images[0]  # ← Works: ~6-8GB VRAM
image.save('output.png')

# Verify offloading worked:
print(f'GPU memory after generation: {torch.cuda.memory_allocated() / 1e9:.2f} GB')
torch.cuda.empty_cache()  # Free fragmented memory for next run

Added enable_attention_slicing() before enable_model_cpu_offload() to reduce memory per attention block, and called torch.cuda.empty_cache() after inference to prevent memory fragmentation on subsequent runs.

⚠

Workaround

If enable_model_cpu_offload() still causes OOM: use sequential_cpu_offload() instead (slower but uses <2GB). Code: `pipe.enable_sequential_cpu_offload()` (moves ONE layer at a time to CPU, trades speed for ~50% lower peak memory). Or reduce image resolution to 512x512 and use num_inference_steps=20 instead of 50 to reduce latent tensor size.

✓

Prevention

Architecture: use structured memory management. After pipeline init, always call `.enable_attention_slicing()` → `.enable_model_cpu_offload()` in that order. Monitor GPU memory before and after offloading with a utility function. For production, pre-allocate reserved memory: `torch.cuda.set_per_process_memory_fraction(0.9)` to prevent fragmentation. Test on target GPU size (e.g., run inference on 6GB card in dev) before deploying.

Python 3.9+ · diffusers >=0.21.0 · tested on 0.27.x

Verified 2026-04 · stable-diffusion-xl-base-1.0, stable-diffusion-v1-5

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.