How to Intermediate · 4 min read

Quantization for mobile devices

Quick answer
Quantization reduces the precision of model weights and activations from floating-point to lower-bit integers (e.g., 8-bit) to shrink model size and speed up inference on mobile devices. Use frameworks like PyTorch with BitsAndBytesConfig or TFLite for efficient quantized models tailored to mobile hardware.

PREREQUISITES

  • Python 3.8+
  • pip install torch torchvision torchaudio
  • pip install transformers bitsandbytes
  • Basic knowledge of PyTorch or TensorFlow

Setup

Install necessary Python packages for quantization and mobile deployment. torch and transformers support quantization workflows. bitsandbytes enables 4-bit and 8-bit quantization for PyTorch models. For TensorFlow, use tensorflow and tflite-runtime.

bash
pip install torch torchvision torchaudio transformers bitsandbytes

Step by step

Example: Quantize a Hugging Face transformer model to 8-bit for mobile inference using bitsandbytes and PyTorch.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import load_in_8bit
import torch
import os

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare input
prompt = "Explain quantization for mobile devices."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Explain quantization for mobile devices. Quantization reduces the precision of model weights and activations, allowing models to run efficiently on limited hardware by using less memory and compute.

Common variations

Other quantization approaches include:

  • Post-training quantization: Convert a trained model to 8-bit or 4-bit without retraining, suitable for quick deployment.
  • Quantization-aware training (QAT): Train the model with quantization simulated to maintain accuracy.
  • TFLite quantization: TensorFlow Lite supports full integer quantization optimized for mobile CPUs and NPUs.
MethodDescriptionUse case
Post-training quantizationConvert FP32 model to INT8/4-bit after trainingFast deployment with some accuracy loss
Quantization-aware trainingTrain model simulating quantization effectsBest accuracy on quantized models
TFLite quantizationTensorFlow Lite full integer quantizationMobile and embedded devices with TensorFlow models

Troubleshooting

If quantized model accuracy drops significantly, try quantization-aware training instead of post-training quantization. Ensure your mobile device supports the target quantization format (e.g., INT8). For PyTorch, verify bitsandbytes is installed correctly and CUDA drivers are compatible.

Key Takeaways

  • Quantization reduces model size and speeds up inference by lowering numeric precision.
  • Use 8-bit or 4-bit quantization with frameworks like PyTorch + bitsandbytes or TensorFlow Lite for mobile devices.
  • Quantization-aware training preserves accuracy better than post-training quantization.
  • Check hardware compatibility and software dependencies to avoid runtime errors.
  • TFLite is the preferred tool for TensorFlow models targeting mobile and embedded platforms.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗