How to Intermediate · 3 min read

How to run Llama on CPU

Q: How to run Llama on CPU

To run Llama models on CPU, use the transformers library with BitsAndBytesConfig for 4-bit quantization to reduce memory usage. Load the model with device_map="cpu" and load_in_4bit=True to enable efficient CPU inference.

Quick answer

To run Llama models on CPU, use the transformers library with BitsAndBytesConfig for 4-bit quantization to reduce memory usage. Load the model with device_map="cpu" and load_in_4bit=True to enable efficient CPU inference.

PREREQUISITES

Python 3.8+
pip install torch transformers accelerate bitsandbytes
Enough CPU RAM (16GB+ recommended)
Optional: pip install peft for LoRA fine-tuning

Setup

Install the necessary Python packages to run Llama models on CPU with quantization support.

bash

pip install torch transformers accelerate bitsandbytes

Step by step

This example loads a Llama 3.1 8B instruct model on CPU using 4-bit quantization for efficient inference. It runs a simple prompt and prints the generated text.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="cpu"
)

prompt = "Write a short poem about AI and CPUs."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Write a short poem about AI and CPUs.

In circuits deep, where data flows,
AI wakes, and knowledge grows.
On CPUs, the thoughts arise,
Silent minds that analyze.

Common variations

Use device_map="auto" if you have a GPU available.
For smaller models, omit quantization for full precision.
Use peft library to apply LoRA adapters on CPU.
Run inference asynchronously with accelerate for better CPU utilization.

Troubleshooting

If you get OutOfMemoryError, reduce batch size or use 8-bit quantization instead.
Ensure bitsandbytes is installed correctly; it requires a compatible CPU architecture.
Use torch.float16 compute dtype only if your CPU supports it; otherwise use torch.float32.
Check that your Python environment matches the installed package versions.

✅

Key Takeaways

Use 4-bit quantization with BitsAndBytesConfig to run Llama efficiently on CPU.
Load the model with device_map="cpu" to force CPU inference.
Install bitsandbytes and transformers packages for quantized model support.
Adjust compute dtype based on your CPU capabilities for best performance.
Use peft for applying LoRA adapters on CPU if fine-tuning is needed.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗