How to beginner · 3 min read

How to load model in 4-bit quantization Hugging Face

Q: How to load model in 4-bit quantization Hugging Face

Use the transformers library with bitsandbytes to load Hugging Face models in 4-bit quantization by setting load_in_4bit=True in from_pretrained. This reduces memory usage and speeds up inference while maintaining accuracy.

Quick answer

Use the transformers library with bitsandbytes to load Hugging Face models in 4-bit quantization by setting load_in_4bit=True in from_pretrained. This reduces memory usage and speeds up inference while maintaining accuracy.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install bitsandbytes
pip install accelerate

Setup

Install the required libraries transformers, bitsandbytes, and accelerate to enable 4-bit quantization support.

bash

pip install transformers bitsandbytes accelerate

Step by step

Load a Hugging Face model with 4-bit quantization using transformers and bitsandbytes. This example loads the facebook/opt-1.3b model in 4-bit mode for efficient inference.

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-1.3b"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

# Encode input and generate output
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you? I am doing well, thank you for asking.

Common variations

Use load_in_8bit=True for 8-bit quantization instead of 4-bit.
Specify device_map manually for multi-GPU setups.
Use transformers pipeline with 4-bit models by passing model_kwargs={'load_in_4bit': True}.

python

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="facebook/opt-1.3b",
    model_kwargs={"load_in_4bit": True, "device_map": "auto"}
)

print(pipe("Hello, world!", max_new_tokens=20)[0]['generated_text'])

output

Hello, world! I hope you are having a great day.

Troubleshooting

If you see bitsandbytes import errors, ensure it is installed and your CUDA version is compatible.
For device_map errors, try setting device_map='auto' or manually assign devices.
If memory errors occur, verify your GPU has enough VRAM or try 8-bit quantization.

✅

Key Takeaways

Use load_in_4bit=True in from_pretrained to enable 4-bit quantization loading.
Install bitsandbytes and accelerate to support quantized model loading.
Set device_map='auto' to automatically place model layers on available GPUs.
4-bit quantization reduces memory usage and speeds up inference with minimal accuracy loss.
Troubleshoot by verifying CUDA compatibility and GPU memory availability.

Verified 2026-04 · facebook/opt-1.3b

Verify ↗