How to Intermediate · 3 min read

How to use GPTQ quantized models

Quick answer
Use GPTQ quantized models by loading them with compatible libraries like Hugging Face Transformers combined with BitsAndBytesConfig for 4-bit or 8-bit quantization. This reduces memory and speeds up inference while maintaining accuracy, enabling efficient deployment of large language models.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0
  • pip install bitsandbytes
  • pip install torch
  • Access to a GPTQ quantized model checkpoint

Setup

Install the necessary Python packages to load and run GPTQ quantized models. Use transformers for model loading, bitsandbytes for quantization support, and torch for tensor operations.

bash
pip install transformers bitsandbytes torch

Step by step

Load a GPTQ quantized model using BitsAndBytesConfig to specify 4-bit quantization. This example shows how to load a GPTQ quantized LLaMA model and run a simple text generation.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.float16
)

# Load tokenizer and quantized model
model_name = "path_or_hub_to_gptq_quantized_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

# Prepare input prompt
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer("Hello, GPTQ quantized model!", return_tensors="pt").to(dev)

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Hello, GPTQ quantized model! Here is a demonstration of efficient inference using 4-bit quantization.

Common variations

  • Use 8-bit quantization by setting load_in_4bit=False and load_in_8bit=True in BitsAndBytesConfig.
  • Run inference on CPU by setting device_map="cpu", but expect slower performance.
  • Use different GPTQ quantized models by changing the model_name to the appropriate checkpoint.
  • Combine with LoRA or other fine-tuning methods on quantized models for customization.

Troubleshooting

  • If you get RuntimeError: CUDA out of memory, reduce batch size or use 8-bit quantization instead of 4-bit.
  • If the model fails to load, verify the checkpoint is GPTQ quantized and compatible with bitsandbytes.
  • Ensure your PyTorch and bitsandbytes versions are compatible; upgrade if needed.
  • For tokenization errors, confirm the tokenizer matches the model architecture.

Key Takeaways

  • GPTQ quantized models enable efficient inference by reducing model size with minimal accuracy loss.
  • Use BitsAndBytesConfig with Hugging Face Transformers to load 4-bit or 8-bit quantized models easily.
  • Always match tokenizer and model checkpoints to avoid compatibility issues.
  • Adjust device mapping and quantization settings based on your hardware capabilities.
  • Keep dependencies like PyTorch and bitsandbytes up to date for smooth operation.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, GPTQ quantized LLaMA
Verify ↗