Code Beginner easy · 5 min

Slow CPU inference: no GPU

What you will learn

CPU inference is dramatically slower than GPU, but it still works: understand why and what you're trading off.

Why this matters

Most developers start without GPU access (laptops, free tier servers). Understanding CPU bottlenecks helps you estimate if your inference is acceptable or if you need hardware upgrades: and explains why the same model runs 50–100× slower on CPU.

Skip if: Don't use this knowledge as an excuse to skip GPU entirely in production. If you need inference latency under 500ms, CPU is not viable. If you're running one inference per hour on a laptop for exploration, CPU is fine.

Explanation

What it is: Running transformer model inference on CPU (main processor) instead of GPU (graphics card). Transformers are linear algebra–heavy; CPUs execute one operation at a time, while GPUs execute thousands in parallel.

How it works mechanically: When you load a model with device='cpu' or don't specify a device, PyTorch runs matrix multiplications (the core of transformers) sequentially on your CPU cores. Each forward pass through a 7B parameter model requires billions of arithmetic operations. A GPU with 5,000+ cores processes these in parallel; a CPU with 8 cores processes them serially. The math is straightforward: same work, fewer workers, proportionally longer time.

When to use it: Development, testing, and tiny models (< 500M parameters) on machines without GPU. For production or latency-sensitive use, GPU or inference optimization (quantization, distillation) is required.

Analogy

Imagine washing 1,000 dishes. A GPU is 5,000 people each washing one dish in parallel (done in seconds). A CPU is one person washing all 1,000 dishes alone (takes hours). The dishes don't change; the parallelism does.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time

# Small model for CPU demo
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cpu')

prompt = 'The future of AI is'
inputs = tokenizer(prompt, return_tensors='pt').to('cpu')

start = time.time()
with torch.no_grad():
    outputs = model.generate(
        inputs['input_ids'],
        max_length=20,
        do_sample=False,
        num_return_sequences=1
    )
end = time.time()

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f'Generated: {generated_text}')
print(f'Time elapsed: {end - start:.2f} seconds')
print(f'Device used: cpu')

Output

Generated: The future of AI is very bright. We have a lot of great things
Time elapsed: 3.45 seconds
Device used: cpu

What just happened?

The code loaded GPT-2 (124M parameters) explicitly onto CPU, generated 20 tokens from a prompt using <code>model.generate()</code>, and measured wall-clock time. On a typical 4-core laptop CPU, this took ~3–5 seconds. The same code on a modern GPU (NVIDIA A100) would take ~0.15 seconds: 30× faster.

Common gotcha

Developers often assume their inference code is slow because it's poorly written. It's not: it's just CPU. Moving the exact same code to device='cuda' reveals the real bottleneck was hardware, not algorithm. Don't optimize prematurely on CPU; get to GPU first.

Error recovery

CUDA out of memory

You have a GPU but it's full. This is not a CPU problem: it means you tried to move to GPU and ran out. Reduce batch size, use quantization (torch_dtype=torch.bfloat16), or add device_map='auto' to split across multiple GPUs.

Model loads but inference hangs

CPU inference on very large models (> 13B) can hang or become unresponsive. This is not an error: it's just so slow it appears stuck. Either reduce model size or add GPU.

Experienced dev note

The single biggest mistake senior engineers make: they benchmark on GPU, declare the system 'production-ready,' then deploy to a CPU-only environment and get shocked by 100× slowdown. Always benchmark on your target hardware. If your target is CPU-only (edge device, cost-constrained server), design with quantized small models (like DistilBERT) from the start, not by down-converting a 7B model later.

Check your understanding

You measure inference latency on your laptop CPU and get 8 seconds per request. Your API needs 500ms response time. What are two concrete approaches to fix this that don't require buying new hardware? (Hint: one is about changing the model, one is about changing how you run it.)

Show answer hint

A correct answer recognizes: (1) quantization or distillation reduces model size so CPU is faster, and (2) batching multiple requests together can amortize cost: but neither solves 8s→500ms on CPU alone. The real answer is: you cannot hit 500ms on CPU for this model; you need GPU or a much smaller model. Understanding this gap is the key.

VERSION transformers 5.5.x changed how device placement works. In 4.x, you used model.to('cpu') post-load. In 5.5.x, always use device_map='cpu' at load time: it's clearer and integrates with quantization config. Old pattern: model = AutoModelForCausalLM.from_pretrained(name).to('cpu'). New pattern: model = AutoModelForCausalLM.from_pretrained(name, device_map='cpu').

Now that you understand CPU is slow, learn how to move your model to GPU with <code>device_map='cuda'</code> and see the 30–50× speedup in practice.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.