Code Intermediate medium · 6 min

load_in_4bit: extreme compression

What you will learn
Compress large language models to 4-bit precision using BitsAndBytesConfig, cutting memory usage by ~75% while maintaining usable inference quality.

Why this matters

Modern LLMs like Llama 2 70B or Mixtral require 140GB+ in full precision: impossible on consumer GPUs. 4-bit quantization lets you run state-of-the-art models on a single 24GB GPU (RTX 4090) or even smaller hardware, making fine-tuning and inference economically viable.

Skip if: Do not use 4-bit loading for: (1) tasks requiring maximum accuracy (medical diagnosis, financial calculations), (2) models already quantized by the publisher (they've tested and provided GGUF/AWQ versions), (3) small models <7B parameters where memory isn't the bottleneck, (4) training workflows requiring gradient computation (4-bit doesn't support backprop without additional techniques like QLoRA).

Explanation

What it is: 4-bit quantization represents model weights using only 4 bits (0-15 range) instead of 32 bits (float32), reducing memory footprint by 87.5%. The Hugging Face BitsAndBytesConfig paired with load_in_4bit=True handles this compression automatically during model loading using the bitsandbytes library.

How it works mechanically: When you set load_in_4bit=True, Hugging Face intercepts the model loading process. Before weights are copied to GPU memory, bitsandbytes quantizes them: it scales each weight tensor to fit in 4 bits (storing an additional scale factor and offset). During inference, weights are dequantized on-the-fly in GPU memory: this is asymmetric quantization (weights compressed, computation in higher precision). The bnb_4bit_quant_type parameter controls whether quantization is symmetric (all weights scaled identically) or NF4 (normalized float 4, optimized for normal distributions of weights).

When to use it: Use 4-bit loading when: (1) you want to run large models on limited hardware, (2) inference latency is acceptable (dequantization adds ~5-10% overhead), (3) you're using inference-only pipelines or QLoRA fine-tuning (which uses LoRA adapters on top of frozen 4-bit weights).

Analogy

Imagine a high-resolution image (float32 weights) stored in full color. 4-bit quantization is like converting it to a 16-color palette with a separate brightness map: the palette is tiny, but you restore colors from the map during viewing. You lose some detail, but the file is 1/8th the size and displays instantly on old hardware.

Code

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quant_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

inputs = tokenizer("What is 2+2?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)
print(f"\nModel dtype: {model.dtype}")
print(f"Model device: {model.device}")
print(f"Estimated memory: ~7B * 0.5 bytes (4-bit + overhead) ≈ 3.5GB")
Output
What is 2+2? The answer is 4. It is a simple arithmetic operation.

Model dtype: torch.float16
Model device: cuda:0
Estimated memory: ~7B * 0.5 bytes (4-bit + overhead) ≈ 3.5GB

What just happened?

We created a BitsAndBytesConfig specifying 4-bit quantization with NF4 quantization type and double quantization (quantize the scale factors themselves). We loaded a 7B parameter model using `from_pretrained()` with this config: Hugging Face automatically compressed the weights to 4-bit during loading. The model was placed on GPU via `device_map='auto'`. We then tokenized an input, ran generation (weights were dequantized on-the-fly during forward passes), and decoded the output. The actual model weights in memory are 4-bit, but computation happened in bfloat16 precision.

Common gotcha

Developers often set `load_in_4bit=True` but forget to include `quantization_config=BitsAndBytesConfig(...)` with proper dtype settings. Without the config, quantization may use incompatible dtypes for computation, causing silent numerical errors or GPU memory spikes. Also: 4-bit models cannot be saved directly with `model.save_pretrained()`: only LoRA adapters trained on top can be saved. If you need to persist the quantized model, you must quantize again on next load.

Error recovery

RuntimeError: Expected scalar type Double but found Half
Mismatch between weight dtype and compute dtype. Ensure `bnb_4bit_compute_dtype` matches the dtype used in forward passes. Fix: set `bnb_4bit_compute_dtype=torch.bfloat16` or `torch.float16` depending on your GPU (bfloat16 for newer A100/H100, float16 for older V100).
AttributeError: type object 'BitsAndBytesConfig' has no attribute 'load_in_4bit'
You're using transformers < 4.30 or bitsandbytes not installed. Fix: run `pip install --upgrade transformers>=5.5.0 bitsandbytes` and confirm bitsandbytes version is >=0.41.0.
CUDA out of memory even with load_in_4bit=True
You're still running out of memory during generation (e.g., batch size too large or max_new_tokens too high). 4-bit reduces model weight memory, but generation buffers (KV cache, activations) still consume VRAM. Fix: reduce `batch_size=1` and lower `max_new_tokens` or use `device_map='cpu'` to offload non-critical layers.
ValueError: Attempting to deserialize object but no `_builtin_model_class` was stored in the config
You tried to load a 4-bit quantized model without the quantization config. 4-bit models cannot be loaded as standard checkpoints. Fix: always pass `quantization_config=BitsAndBytesConfig(load_in_4bit=True, ...)` when loading: do not try to save/load the 4-bit weights themselves.

Experienced dev note

A subtle gotcha: double quantization (`bnb_4bit_use_double_quant=True`) saves another ~0.4 bits per weight by quantizing the scale factors themselves, but it adds ~5-10% latency overhead. For inference latency-critical applications (sub-100ms SLA), disable it. For training LoRA adapters where you run fewer forward passes, enable it: the memory savings outweigh the latency cost. Also: the `bnb_4bit_quant_type='nf4'` (normalized float 4) is almost always better than 'fp4' because weight distributions are typically normal, not uniform: but fp4 is slightly faster if you're bandwidth-limited. Test both in your actual use case.

Check your understanding

Your model loads successfully in 4-bit but inference is slower than expected. You used `bnb_4bit_use_double_quant=True`. Explain why this overhead exists and why it doesn't apply equally to weight loading vs. inference.

Show answer hint

A correct answer recognizes that double quantization saves memory at load time (one-time cost) but adds per-inference overhead during dequantization (repetitive cost). The scale factors must be dequantized every forward pass, not just once. This is a memory-latency tradeoff that depends on whether you prioritize VRAM or throughput.

VERSION In transformers < 4.30.0, `BitsAndBytesConfig` did not exist; you had to manually configure bitsandbytes. In transformers 5.5.x (April 2026), `load_in_4bit` is the standard pattern. Also note: `bnb_4bit_quant_type='nf4'` became the default in transformers 4.35+, replacing older 'fp4' behavior.
NEXT

Once you're comfortable loading models in 4-bit, the next natural step is <strong>QLoRA fine-tuning</strong>: how to train lightweight adapters on top of frozen 4-bit base models without running out of memory.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.