How to run Llama on CPU
Quick answer
To run
Llama models on CPU, use the transformers library with BitsAndBytesConfig for 4-bit quantization to reduce memory usage. Load the model with device_map="cpu" and load_in_4bit=True to enable efficient CPU inference.PREREQUISITES
Python 3.8+pip install torch transformers accelerate bitsandbytesEnough CPU RAM (16GB+ recommended)Optional: pip install peft for LoRA fine-tuning
Setup
Install the necessary Python packages to run Llama models on CPU with quantization support.
pip install torch transformers accelerate bitsandbytes Step by step
This example loads a Llama 3.1 8B instruct model on CPU using 4-bit quantization for efficient inference. It runs a simple prompt and prints the generated text.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="cpu"
)
prompt = "Write a short poem about AI and CPUs."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Write a short poem about AI and CPUs. In circuits deep, where data flows, AI wakes, and knowledge grows. On CPUs, the thoughts arise, Silent minds that analyze.
Common variations
- Use
device_map="auto"if you have a GPU available. - For smaller models, omit quantization for full precision.
- Use
peftlibrary to apply LoRA adapters on CPU. - Run inference asynchronously with
acceleratefor better CPU utilization.
Troubleshooting
- If you get
OutOfMemoryError, reduce batch size or use 8-bit quantization instead. - Ensure
bitsandbytesis installed correctly; it requires a compatible CPU architecture. - Use
torch.float16compute dtype only if your CPU supports it; otherwise usetorch.float32. - Check that your Python environment matches the installed package versions.
Key Takeaways
- Use 4-bit quantization with BitsAndBytesConfig to run Llama efficiently on CPU.
- Load the model with device_map="cpu" to force CPU inference.
- Install bitsandbytes and transformers packages for quantized model support.
- Adjust compute dtype based on your CPU capabilities for best performance.
- Use peft for applying LoRA adapters on CPU if fine-tuning is needed.