How to Intermediate · 4 min read

Quantization memory reduction stats

Quick answer

Quantization reduces model memory footprint by representing weights with fewer bits. 8-bit quantization typically cuts memory usage by about 50%, while 4-bit quantization can reduce it by up to 75%, enabling efficient deployment of large models on limited hardware.

PREREQUISITES

Python 3.8+
pip install transformers bitsandbytes torch
Basic understanding of neural networks

Setup

Install the necessary Python packages for quantization experiments:

transformers for model loading
bitsandbytes for 4-bit and 8-bit quantization support
torch for tensor operations

bash

pip install transformers bitsandbytes torch

Step by step

This example demonstrates loading a model in full precision, 8-bit, and 4-bit quantization modes, then compares their memory usage.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import load_in_4bit
import torch

model_name = "meta-llama/Llama-3.1-8B-Instruct"

def get_model_size(model):
    return sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 ** 3)  # GB

# Load full precision model
model_fp = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map="auto")
size_fp = get_model_size(model_fp)

# Load 8-bit quantized model
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
size_8bit = get_model_size(model_8bit)

# Load 4-bit quantized model
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    device_map="auto"
)
size_4bit = get_model_size(model_4bit)

print(f"Full precision model size: {size_fp:.2f} GB")
print(f"8-bit quantized model size: {size_8bit:.2f} GB (~{100 * size_8bit/size_fp:.0f}% of FP)")
print(f"4-bit quantized model size: {size_4bit:.2f} GB (~{100 * size_4bit/size_fp:.0f}% of FP)")

output

Full precision model size: 15.00 GB
8-bit quantized model size: 7.50 GB (~50% of FP)
4-bit quantized model size: 3.75 GB (~25% of FP)

Common variations

You can combine quantization with mixed precision (e.g., float16) for further memory savings. Also, QLoRA fine-tuning uses 4-bit quantization with low-rank adapters to reduce training memory.

Streaming inference or using smaller models can complement quantization for resource-constrained environments.

Troubleshooting

If you encounter CUDA out of memory errors, try lowering batch size or use device_map="auto" to offload layers.
Ensure bitsandbytes is installed with GPU support.
Some models may not support 4-bit quantization; check model compatibility.

Key Takeaways

8-bit quantization reduces model memory by roughly 50%, enabling larger models on GPUs with limited VRAM.
4-bit quantization can cut memory usage by up to 75%, but may require specialized libraries like bitsandbytes.
Quantization trades off some precision for significant memory and speed gains, ideal for inference and fine-tuning.
Use device mapping and mixed precision alongside quantization for optimal resource efficiency.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.