How to Intermediate · 4 min read

How to quantize LLM with BitsAndBytes

Q: How to quantize LLM with BitsAndBytes

Use the BitsAndBytesConfig class from transformers to configure 4-bit or 8-bit quantization, then load your LLM with load_in_4bit=True via AutoModelForCausalLM.from_pretrained. This reduces model size and speeds up inference while maintaining accuracy.

Quick answer

Use the BitsAndBytesConfig class from transformers to configure 4-bit or 8-bit quantization, then load your LLM with load_in_4bit=True via AutoModelForCausalLM.from_pretrained. This reduces model size and speeds up inference while maintaining accuracy.

PREREQUISITES

Python 3.8+
pip install transformers bitsandbytes torch
Basic knowledge of Hugging Face Transformers

Setup

Install the required packages: transformers for model loading, bitsandbytes for quantization support, and torch for PyTorch backend.

bash

pip install transformers bitsandbytes torch

Step by step

Use BitsAndBytesConfig to specify 4-bit quantization and load the model with this config. This example loads meta-llama/Llama-3.1-8B-Instruct in 4-bit mode on the GPU.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

prompt = "Explain quantization in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Explain quantization in simple terms. Quantization is a technique to reduce the size of a model by using fewer bits to represent numbers, which speeds up inference and lowers memory usage.

Common variations

Use load_in_8bit=True in BitsAndBytesConfig for 8-bit quantization.
Combine with LoRA adapters for fine-tuning quantized models.
Use device_map="auto" to automatically place model layers on available GPUs.

Troubleshooting

If you get CUDA out-of-memory errors, try lowering batch size or use 8-bit quantization instead of 4-bit.
Ensure your GPU supports float16 compute for best performance.
Check that bitsandbytes is installed correctly and compatible with your PyTorch version.

Key Takeaways

Use BitsAndBytesConfig with load_in_4bit=True to quantize LLMs efficiently.
Quantization reduces memory usage and speeds up inference with minimal accuracy loss.
Always use device_map="auto" to optimize GPU memory allocation.
Switch to 8-bit quantization if 4-bit causes stability or memory issues.
Verify bitsandbytes compatibility with your environment to avoid runtime errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.