Slow CPU inference: no GPU
Why this matters
Most developers start without GPU access (laptops, free tier servers). Understanding CPU bottlenecks helps you estimate if your inference is acceptable or if you need hardware upgrades: and explains why the same model runs 50–100× slower on CPU.
Explanation
What it is: Running transformer model inference on CPU (main processor) instead of GPU (graphics card). Transformers are linear algebra–heavy; CPUs execute one operation at a time, while GPUs execute thousands in parallel.
How it works mechanically: When you load a model with device='cpu' or don't specify a device, PyTorch runs matrix multiplications (the core of transformers) sequentially on your CPU cores. Each forward pass through a 7B parameter model requires billions of arithmetic operations. A GPU with 5,000+ cores processes these in parallel; a CPU with 8 cores processes them serially. The math is straightforward: same work, fewer workers, proportionally longer time.
When to use it: Development, testing, and tiny models (< 500M parameters) on machines without GPU. For production or latency-sensitive use, GPU or inference optimization (quantization, distillation) is required.
Analogy
Imagine washing 1,000 dishes. A GPU is 5,000 people each washing one dish in parallel (done in seconds). A CPU is one person washing all 1,000 dishes alone (takes hours). The dishes don't change; the parallelism does.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
# Small model for CPU demo
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cpu')
prompt = 'The future of AI is'
inputs = tokenizer(prompt, return_tensors='pt').to('cpu')
start = time.time()
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_length=20,
do_sample=False,
num_return_sequences=1
)
end = time.time()
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f'Generated: {generated_text}')
print(f'Time elapsed: {end - start:.2f} seconds')
print(f'Device used: cpu') Generated: The future of AI is very bright. We have a lot of great things Time elapsed: 3.45 seconds Device used: cpu
What just happened?
The code loaded GPT-2 (124M parameters) explicitly onto CPU, generated 20 tokens from a prompt using <code>model.generate()</code>, and measured wall-clock time. On a typical 4-core laptop CPU, this took ~3–5 seconds. The same code on a modern GPU (NVIDIA A100) would take ~0.15 seconds: 30× faster.
Common gotcha
Developers often assume their inference code is slow because it's poorly written. It's not: it's just CPU. Moving the exact same code to device='cuda' reveals the real bottleneck was hardware, not algorithm. Don't optimize prematurely on CPU; get to GPU first.
Error recovery
CUDA out of memoryModel loads but inference hangsExperienced dev note
The single biggest mistake senior engineers make: they benchmark on GPU, declare the system 'production-ready,' then deploy to a CPU-only environment and get shocked by 100× slowdown. Always benchmark on your target hardware. If your target is CPU-only (edge device, cost-constrained server), design with quantized small models (like DistilBERT) from the start, not by down-converting a 7B model later.
Check your understanding
You measure inference latency on your laptop CPU and get 8 seconds per request. Your API needs 500ms response time. What are two concrete approaches to fix this that don't require buying new hardware? (Hint: one is about changing the model, one is about changing how you run it.)
Show answer hint
A correct answer recognizes: (1) quantization or distillation reduces model size so CPU is faster, and (2) batching multiple requests together can amortize cost: but neither solves 8s→500ms on CPU alone. The real answer is: you cannot hit 500ms on CPU for this model; you need GPU or a much smaller model. Understanding this gap is the key.
model.to('cpu') post-load. In 5.5.x, always use device_map='cpu' at load time: it's clearer and integrates with quantization config. Old pattern: model = AutoModelForCausalLM.from_pretrained(name).to('cpu'). New pattern: model = AutoModelForCausalLM.from_pretrained(name, device_map='cpu').