How to Intermediate · 3 min read

Stable Diffusion quantization guide

Q: Stable Diffusion quantization guide

Quantize Stable Diffusion models using bitsandbytes with load_in_4bit=True in BitsAndBytesConfig to reduce memory and speed up inference. Use transformers to load the model with quantization configs and run inference efficiently.

Quick answer

Quantize Stable Diffusion models using bitsandbytes with load_in_4bit=True in BitsAndBytesConfig to reduce memory and speed up inference. Use transformers to load the model with quantization configs and run inference efficiently.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0 bitsandbytes torch
Access to a CUDA-enabled GPU for best performance

Setup

Install required packages: transformers for model loading, bitsandbytes for quantization support, and torch for PyTorch backend. Ensure you have a CUDA GPU for 4-bit quantization acceleration.

bash

pip install transformers bitsandbytes torch

Step by step

Load a Stable Diffusion model with 4-bit quantization using BitsAndBytesConfig. This reduces VRAM usage significantly while maintaining good quality. Then run a simple text-to-image generation.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer and model with quantization config
model_name = "runwayml/stable-diffusion-v1-5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

# Example prompt
prompt = "A fantasy landscape with mountains and rivers"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output (dummy example, replace with actual SD pipeline for images)
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

A fantasy landscape with mountains and rivers flowing through a lush valley.

Common variations

Use 8-bit quantization by setting load_in_4bit=False and load_in_8bit=True in BitsAndBytesConfig.
Run inference asynchronously with asyncio if integrating into async pipelines.
Use different Stable Diffusion versions or custom fine-tuned models by changing model_name.

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 8-bit quantization config
quant_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    quantization_config=quant_config_8bit,
    device_map="auto"
)

Troubleshooting

If you get CUDA out-of-memory errors, reduce batch size or switch to 8-bit quantization.
Ensure bitsandbytes is installed correctly and your GPU supports 4-bit operations.
For CPU-only machines, quantization benefits are limited; consider running on GPU.

✅

Key Takeaways

Use BitsAndBytesConfig with load_in_4bit=True to quantize Stable Diffusion models for efficient GPU memory usage.
4-bit quantization offers a strong balance of speed and quality; 8-bit is an alternative for compatibility.
Always run quantized models on CUDA GPUs for best performance and avoid out-of-memory errors by adjusting batch sizes.

Verified 2026-04 · runwayml/stable-diffusion-v1-5

Verify ↗