How to Intermediate · 4 min read

Quantization for mobile devices

Quick answer

Quantization reduces the precision of model weights and activations from floating-point to lower-bit integers (e.g., 8-bit) to shrink model size and speed up inference on mobile devices. Use frameworks like PyTorch with BitsAndBytesConfig or TFLite for efficient quantized models tailored to mobile hardware.

PREREQUISITES

Python 3.8+
pip install torch torchvision torchaudio
pip install transformers bitsandbytes
Basic knowledge of PyTorch or TensorFlow

Setup

Install necessary Python packages for quantization and mobile deployment. torch and transformers support quantization workflows. bitsandbytes enables 4-bit and 8-bit quantization for PyTorch models. For TensorFlow, use tensorflow and tflite-runtime.

bash

pip install torch torchvision torchaudio transformers bitsandbytes

Step by step

Example: Quantize a Hugging Face transformer model to 8-bit for mobile inference using bitsandbytes and PyTorch.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import load_in_8bit
import torch
import os

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare input
prompt = "Explain quantization for mobile devices."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Explain quantization for mobile devices. Quantization reduces the precision of model weights and activations, allowing models to run efficiently on limited hardware by using less memory and compute.

Common variations

Other quantization approaches include:

Post-training quantization: Convert a trained model to 8-bit or 4-bit without retraining, suitable for quick deployment.
Quantization-aware training (QAT): Train the model with quantization simulated to maintain accuracy.
TFLite quantization: TensorFlow Lite supports full integer quantization optimized for mobile CPUs and NPUs.

Method	Description	Use case
Post-training quantization	Convert FP32 model to INT8/4-bit after training	Fast deployment with some accuracy loss
Quantization-aware training	Train model simulating quantization effects	Best accuracy on quantized models
TFLite quantization	TensorFlow Lite full integer quantization	Mobile and embedded devices with TensorFlow models

Troubleshooting

If quantized model accuracy drops significantly, try quantization-aware training instead of post-training quantization. Ensure your mobile device supports the target quantization format (e.g., INT8). For PyTorch, verify bitsandbytes is installed correctly and CUDA drivers are compatible.

✅

Key Takeaways

Quantization reduces model size and speeds up inference by lowering numeric precision.
Use 8-bit or 4-bit quantization with frameworks like PyTorch + bitsandbytes or TensorFlow Lite for mobile devices.
Quantization-aware training preserves accuracy better than post-training quantization.
Check hardware compatibility and software dependencies to avoid runtime errors.
TFLite is the preferred tool for TensorFlow models targeting mobile and embedded platforms.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗