Stable Diffusion quantization guide
Quick answer
Quantize Stable Diffusion models using
bitsandbytes with load_in_4bit=True in BitsAndBytesConfig to reduce memory and speed up inference. Use transformers to load the model with quantization configs and run inference efficiently.PREREQUISITES
Python 3.8+pip install transformers>=4.30.0 bitsandbytes torchAccess to a CUDA-enabled GPU for best performance
Setup
Install required packages: transformers for model loading, bitsandbytes for quantization support, and torch for PyTorch backend. Ensure you have a CUDA GPU for 4-bit quantization acceleration.
pip install transformers bitsandbytes torch Step by step
Load a Stable Diffusion model with 4-bit quantization using BitsAndBytesConfig. This reduces VRAM usage significantly while maintaining good quality. Then run a simple text-to-image generation.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load tokenizer and model with quantization config
model_name = "runwayml/stable-diffusion-v1-5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="auto"
)
# Example prompt
prompt = "A fantasy landscape with mountains and rivers"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate output (dummy example, replace with actual SD pipeline for images)
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
A fantasy landscape with mountains and rivers flowing through a lush valley.
Common variations
- Use 8-bit quantization by setting
load_in_4bit=Falseandload_in_8bit=TrueinBitsAndBytesConfig. - Run inference asynchronously with
asyncioif integrating into async pipelines. - Use different Stable Diffusion versions or custom fine-tuned models by changing
model_name.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 8-bit quantization config
quant_config_8bit = BitsAndBytesConfig(
load_in_8bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model_8bit = AutoModelForCausalLM.from_pretrained(
"runwayml/stable-diffusion-v1-5",
quantization_config=quant_config_8bit,
device_map="auto"
) Troubleshooting
- If you get CUDA out-of-memory errors, reduce batch size or switch to 8-bit quantization.
- Ensure
bitsandbytesis installed correctly and your GPU supports 4-bit operations. - For CPU-only machines, quantization benefits are limited; consider running on GPU.
Key Takeaways
- Use
BitsAndBytesConfigwithload_in_4bit=Trueto quantize Stable Diffusion models for efficient GPU memory usage. - 4-bit quantization offers a strong balance of speed and quality; 8-bit is an alternative for compatibility.
- Always run quantized models on CUDA GPUs for best performance and avoid out-of-memory errors by adjusting batch sizes.