How to use GPTQ quantized models
Quick answer
Use GPTQ quantized models by loading them with compatible libraries like Hugging Face Transformers combined with BitsAndBytesConfig for 4-bit or 8-bit quantization. This reduces memory and speeds up inference while maintaining accuracy, enabling efficient deployment of large language models.
PREREQUISITES
Python 3.8+pip install transformers>=4.30.0pip install bitsandbytespip install torchAccess to a GPTQ quantized model checkpoint
Setup
Install the necessary Python packages to load and run GPTQ quantized models. Use transformers for model loading, bitsandbytes for quantization support, and torch for tensor operations.
pip install transformers bitsandbytes torch Step by step
Load a GPTQ quantized model using BitsAndBytesConfig to specify 4-bit quantization. This example shows how to load a GPTQ quantized LLaMA model and run a simple text generation.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.float16
)
# Load tokenizer and quantized model
model_name = "path_or_hub_to_gptq_quantized_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="auto"
)
# Prepare input prompt
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer("Hello, GPTQ quantized model!", return_tensors="pt").to(dev)
# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Hello, GPTQ quantized model! Here is a demonstration of efficient inference using 4-bit quantization.
Common variations
- Use 8-bit quantization by setting
load_in_4bit=Falseandload_in_8bit=TrueinBitsAndBytesConfig. - Run inference on CPU by setting
device_map="cpu", but expect slower performance. - Use different GPTQ quantized models by changing the
model_nameto the appropriate checkpoint. - Combine with LoRA or other fine-tuning methods on quantized models for customization.
Troubleshooting
- If you get
RuntimeError: CUDA out of memory, reduce batch size or use 8-bit quantization instead of 4-bit. - If the model fails to load, verify the checkpoint is GPTQ quantized and compatible with
bitsandbytes. - Ensure your PyTorch and
bitsandbytesversions are compatible; upgrade if needed. - For tokenization errors, confirm the tokenizer matches the model architecture.
Key Takeaways
- GPTQ quantized models enable efficient inference by reducing model size with minimal accuracy loss.
- Use BitsAndBytesConfig with Hugging Face Transformers to load 4-bit or 8-bit quantized models easily.
- Always match tokenizer and model checkpoints to avoid compatibility issues.
- Adjust device mapping and quantization settings based on your hardware capabilities.
- Keep dependencies like PyTorch and bitsandbytes up to date for smooth operation.