How to Intermediate · 4 min read

Quantization memory reduction stats

Quick answer
Quantization reduces model memory footprint by representing weights with fewer bits. 8-bit quantization typically cuts memory usage by about 50%, while 4-bit quantization can reduce it by up to 75%, enabling efficient deployment of large models on limited hardware.

PREREQUISITES

  • Python 3.8+
  • pip install transformers bitsandbytes torch
  • Basic understanding of neural networks

Setup

Install the necessary Python packages for quantization experiments:

  • transformers for model loading
  • bitsandbytes for 4-bit and 8-bit quantization support
  • torch for tensor operations
bash
pip install transformers bitsandbytes torch

Step by step

This example demonstrates loading a model in full precision, 8-bit, and 4-bit quantization modes, then compares their memory usage.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import load_in_4bit
import torch

model_name = "meta-llama/Llama-3.1-8B-Instruct"

def get_model_size(model):
    return sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 ** 3)  # GB

# Load full precision model
model_fp = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map="auto")
size_fp = get_model_size(model_fp)

# Load 8-bit quantized model
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
size_8bit = get_model_size(model_8bit)

# Load 4-bit quantized model
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    device_map="auto"
)
size_4bit = get_model_size(model_4bit)

print(f"Full precision model size: {size_fp:.2f} GB")
print(f"8-bit quantized model size: {size_8bit:.2f} GB (~{100 * size_8bit/size_fp:.0f}% of FP)")
print(f"4-bit quantized model size: {size_4bit:.2f} GB (~{100 * size_4bit/size_fp:.0f}% of FP)")
output
Full precision model size: 15.00 GB
8-bit quantized model size: 7.50 GB (~50% of FP)
4-bit quantized model size: 3.75 GB (~25% of FP)

Common variations

You can combine quantization with mixed precision (e.g., float16) for further memory savings. Also, QLoRA fine-tuning uses 4-bit quantization with low-rank adapters to reduce training memory.

Streaming inference or using smaller models can complement quantization for resource-constrained environments.

Troubleshooting

  • If you encounter CUDA out of memory errors, try lowering batch size or use device_map="auto" to offload layers.
  • Ensure bitsandbytes is installed with GPU support.
  • Some models may not support 4-bit quantization; check model compatibility.

Key Takeaways

  • 8-bit quantization reduces model memory by roughly 50%, enabling larger models on GPUs with limited VRAM.
  • 4-bit quantization can cut memory usage by up to 75%, but may require specialized libraries like bitsandbytes.
  • Quantization trades off some precision for significant memory and speed gains, ideal for inference and fine-tuning.
  • Use device mapping and mixed precision alongside quantization for optimal resource efficiency.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗