How to Intermediate · 3 min read

Stable Diffusion quantization guide

Quick answer
Quantize Stable Diffusion models using bitsandbytes with load_in_4bit=True in BitsAndBytesConfig to reduce memory and speed up inference. Use transformers to load the model with quantization configs and run inference efficiently.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0 bitsandbytes torch
  • Access to a CUDA-enabled GPU for best performance

Setup

Install required packages: transformers for model loading, bitsandbytes for quantization support, and torch for PyTorch backend. Ensure you have a CUDA GPU for 4-bit quantization acceleration.

bash
pip install transformers bitsandbytes torch

Step by step

Load a Stable Diffusion model with 4-bit quantization using BitsAndBytesConfig. This reduces VRAM usage significantly while maintaining good quality. Then run a simple text-to-image generation.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer and model with quantization config
model_name = "runwayml/stable-diffusion-v1-5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

# Example prompt
prompt = "A fantasy landscape with mountains and rivers"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output (dummy example, replace with actual SD pipeline for images)
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
A fantasy landscape with mountains and rivers flowing through a lush valley.

Common variations

  • Use 8-bit quantization by setting load_in_4bit=False and load_in_8bit=True in BitsAndBytesConfig.
  • Run inference asynchronously with asyncio if integrating into async pipelines.
  • Use different Stable Diffusion versions or custom fine-tuned models by changing model_name.
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 8-bit quantization config
quant_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    quantization_config=quant_config_8bit,
    device_map="auto"
)

Troubleshooting

  • If you get CUDA out-of-memory errors, reduce batch size or switch to 8-bit quantization.
  • Ensure bitsandbytes is installed correctly and your GPU supports 4-bit operations.
  • For CPU-only machines, quantization benefits are limited; consider running on GPU.

Key Takeaways

  • Use BitsAndBytesConfig with load_in_4bit=True to quantize Stable Diffusion models for efficient GPU memory usage.
  • 4-bit quantization offers a strong balance of speed and quality; 8-bit is an alternative for compatibility.
  • Always run quantized models on CUDA GPUs for best performance and avoid out-of-memory errors by adjusting batch sizes.
Verified 2026-04 · runwayml/stable-diffusion-v1-5
Verify ↗