How to Intermediate · 4 min read

How to quantize LLM with BitsAndBytes

Quick answer
Use the BitsAndBytesConfig class from transformers to configure 4-bit or 8-bit quantization, then load your LLM with load_in_4bit=True via AutoModelForCausalLM.from_pretrained. This reduces model size and speeds up inference while maintaining accuracy.

PREREQUISITES

  • Python 3.8+
  • pip install transformers bitsandbytes torch
  • Basic knowledge of Hugging Face Transformers

Setup

Install the required packages: transformers for model loading, bitsandbytes for quantization support, and torch for PyTorch backend.

bash
pip install transformers bitsandbytes torch

Step by step

Use BitsAndBytesConfig to specify 4-bit quantization and load the model with this config. This example loads meta-llama/Llama-3.1-8B-Instruct in 4-bit mode on the GPU.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

prompt = "Explain quantization in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Explain quantization in simple terms. Quantization is a technique to reduce the size of a model by using fewer bits to represent numbers, which speeds up inference and lowers memory usage.

Common variations

  • Use load_in_8bit=True in BitsAndBytesConfig for 8-bit quantization.
  • Combine with LoRA adapters for fine-tuning quantized models.
  • Use device_map="auto" to automatically place model layers on available GPUs.

Troubleshooting

  • If you get CUDA out-of-memory errors, try lowering batch size or use 8-bit quantization instead of 4-bit.
  • Ensure your GPU supports float16 compute for best performance.
  • Check that bitsandbytes is installed correctly and compatible with your PyTorch version.

Key Takeaways

  • Use BitsAndBytesConfig with load_in_4bit=True to quantize LLMs efficiently.
  • Quantization reduces memory usage and speeds up inference with minimal accuracy loss.
  • Always use device_map="auto" to optimize GPU memory allocation.
  • Switch to 8-bit quantization if 4-bit causes stability or memory issues.
  • Verify bitsandbytes compatibility with your environment to avoid runtime errors.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗