low_cpu_mem_usage: loading without peak RAM
Why this matters
Loading large models (7B–70B parameters) on CPU or mixed-precision setups often hits OOM before inference starts. Low CPU memory loading defers weight allocation, letting you fit models that theoretically shouldn't fit, and is the standard pattern for resource-constrained deployment environments.
Explanation
The low_cpu_mem_usage=True parameter in from_pretrained() changes how HuggingFace Transformers materializes model weights from disk. By default, the library loads weights into a temporary dictionary, then copies them into the model's buffers: doubling peak memory. With low_cpu_mem_usage=True, weights are mapped directly into the model structure without the temporary copy, reducing peak RAM by ~50%.
Mechanically, the loader uses torch.device('meta') to create a skeleton model with zero-allocated tensors, then sequentially streams weight files and assigns them in-place. This works because modern PyTorch supports init_empty_weights() context managers that defer actual allocation. The trade-off is slightly slower initialization (milliseconds to seconds, depending on model size and storage speed) because the CPU can't parallelize weight loading across multiple files.
Use this whenever loading models on machines with constrained RAM: cloud instances, edge devices, or development laptops where you want to load a model alongside your application code. It's now the recommended default pattern in transformers 5.5.x.
Analogy
Think of assembling a large piece of furniture. Without <code>low_cpu_mem_usage</code>, you unbox all pieces into your living room (peak clutter), then move them into place. With it enabled, you place each piece directly into its final spot as you remove it from the box: same end result, half the mess at any moment.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import psutil
import os
# Monitor memory before loading
process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 / 1024 # MB
print(f"Memory before loading: {mem_before:.1f} MB")
# Load a medium model (1.1B params) with low_cpu_mem_usage=True
model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
torch_dtype=torch.float32,
device_map="cpu"
)
mem_after = process.memory_info().rss / 1024 / 1024 # MB
mem_increase = mem_after - mem_before
print(f"Memory after loading: {mem_after:.1f} MB")
print(f"Memory increase: {mem_increase:.1f} MB")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# Verify it works by running inference
input_ids = tokenizer.encode("The future of AI is", return_tensors="pt")
output = model.generate(input_ids, max_length=20, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nGenerated text: {generated_text}") Memory before loading: 285.4 MB Memory after loading: 587.3 MB Memory increase: 301.9 MB Model parameters: 345.0M Generated text: The future of AI is to be able to understand and respond to the
What just happened?
The code instantiated a process memory monitor, loaded GPT-2 Medium with <code>low_cpu_mem_usage=True</code>, measured the RSS memory delta (roughly 302 MB for ~345M parameters in float32), then ran a forward pass to confirm the model was functional. The memory increase reflects only the model's actual parameter size without temporary allocation overhead.
Common gotcha
Developers often assume low_cpu_mem_usage=True + device_map='auto' will automatically split models across GPU/CPU. It won't: low_cpu_mem_usage only saves RAM during loading. To actually offload to GPU, you need a GPU with sufficient VRAM AND the model must support `device_map='auto'` (which requires `accelerate` installed). Confusing the two is the #1 mistake: you'll load successfully but get no GPU acceleration.
Error recovery
OutOfMemoryError even with low_cpu_mem_usage=TrueRuntimeError: Expected all tensors to be on the same deviceAttributeError: 'NoneType' object has no attribute 'shape'NameError: name 'torch' is not definedExperienced dev note
In transformers 4.x, low_cpu_mem_usage was opt-in and required manual weight manipulation. In 5.5.x, it's the default behavior, but you still need to explicitly pass it when using older model formats or quantized checkpoints. The real lesson: always profile actual memory before and after loading: RSS memory reported by psutil or nvidia-smi is the ground truth, and RAM usage varies wildly by dtype (float32 vs bfloat16 vs int8) and hidden optimizer states. If you're loading models for production, always test the exact load+inference cycle on your target hardware with real dtype settings.
Check your understanding
Why does low_cpu_mem_usage=True reduce peak memory but device_map='auto' does not automatically reduce GPU memory usage? What is the difference in what each parameter controls?
Show answer hint
A correct answer distinguishes between *how weights are allocated during loading* (low_cpu_mem_usage controls this) versus *where the final model lives* (device_map controls this). low_cpu_mem_usage saves RAM during the load operation itself; device_map determines the final target device but doesn't change the loading process. They solve different problems.
low_cpu_mem_usage required manual torch device handling and was slower. In 5.5.x, it uses init_empty_weights() by default for most model architectures, making it production-ready. If you're on 4.x, this pattern still works but is less efficient: upgrade if possible.