How to Intermediate · 3 min read

How to run Llama on CPU

Quick answer
To run Llama models on CPU, use the transformers library with BitsAndBytesConfig for 4-bit quantization to reduce memory usage. Load the model with device_map="cpu" and load_in_4bit=True to enable efficient CPU inference.

PREREQUISITES

  • Python 3.8+
  • pip install torch transformers accelerate bitsandbytes
  • Enough CPU RAM (16GB+ recommended)
  • Optional: pip install peft for LoRA fine-tuning

Setup

Install the necessary Python packages to run Llama models on CPU with quantization support.

bash
pip install torch transformers accelerate bitsandbytes

Step by step

This example loads a Llama 3.1 8B instruct model on CPU using 4-bit quantization for efficient inference. It runs a simple prompt and prints the generated text.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="cpu"
)

prompt = "Write a short poem about AI and CPUs."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Write a short poem about AI and CPUs.

In circuits deep, where data flows,
AI wakes, and knowledge grows.
On CPUs, the thoughts arise,
Silent minds that analyze.

Common variations

  • Use device_map="auto" if you have a GPU available.
  • For smaller models, omit quantization for full precision.
  • Use peft library to apply LoRA adapters on CPU.
  • Run inference asynchronously with accelerate for better CPU utilization.

Troubleshooting

  • If you get OutOfMemoryError, reduce batch size or use 8-bit quantization instead.
  • Ensure bitsandbytes is installed correctly; it requires a compatible CPU architecture.
  • Use torch.float16 compute dtype only if your CPU supports it; otherwise use torch.float32.
  • Check that your Python environment matches the installed package versions.

Key Takeaways

  • Use 4-bit quantization with BitsAndBytesConfig to run Llama efficiently on CPU.
  • Load the model with device_map="cpu" to force CPU inference.
  • Install bitsandbytes and transformers packages for quantized model support.
  • Adjust compute dtype based on your CPU capabilities for best performance.
  • Use peft for applying LoRA adapters on CPU if fine-tuning is needed.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗