How to intermediate · 3 min read

How to load 4-bit model with BitsAndBytes

Q: How to load 4-bit model with BitsAndBytes

Use the BitsAndBytesConfig class from transformers to specify 4-bit quantization parameters, then load your model with load_in_4bit=True via AutoModelForCausalLM.from_pretrained(). This reduces memory usage and speeds up inference while maintaining good accuracy.

Quick answer

Use the BitsAndBytesConfig class from transformers to specify 4-bit quantization parameters, then load your model with load_in_4bit=True via AutoModelForCausalLM.from_pretrained(). This reduces memory usage and speeds up inference while maintaining good accuracy.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install bitsandbytes
PyTorch installed with CUDA support (for GPU acceleration)

Setup

Install the required packages transformers and bitsandbytes to enable 4-bit quantization support. Ensure you have a compatible GPU and PyTorch with CUDA.

bash

pip install transformers bitsandbytes

Step by step

Load a 4-bit quantized model using BitsAndBytesConfig to configure quantization and pass it to from_pretrained(). This example loads a causal language model with 4-bit weights on GPU.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 for compute
    bnb_4bit_use_double_quant=True,       # Double quantization for better accuracy
    bnb_4bit_quant_type="nf4"             # Quantization type (nf4 or fp4)
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you? I am a large language model trained by Meta.

Common variations

Use bnb_4bit_compute_dtype=torch.float32 for higher precision at some speed cost.
Change bnb_4bit_quant_type to "fp4" for different quantization schemes.
For CPU-only, 4-bit quantization is not supported; use 8-bit or full precision.
Use device_map="auto" to automatically place layers on GPU(s).

Troubleshooting

If you get RuntimeError: CUDA out of memory, reduce batch size or use a smaller model.
Ensure bitsandbytes is installed correctly and compatible with your CUDA version.
If load_in_4bit is not recognized, upgrade transformers to the latest version.

✅

Key Takeaways

Use BitsAndBytesConfig with load_in_4bit=True to enable 4-bit quantization when loading models.
4-bit quantization reduces GPU memory usage and speeds up inference with minimal accuracy loss.
Always match bnb_4bit_compute_dtype and bnb_4bit_quant_type to your precision and performance needs.
Ensure your environment has compatible CUDA, PyTorch, and bitsandbytes versions installed.
Use device_map="auto" to automatically distribute model layers across GPUs.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗