CUDA out of memory: model too large
Why this matters
Large language models like Llama or Mistral often exceed single-GPU memory. Knowing how to load them efficiently means you can run inference on consumer GPUs instead of paying for cloud TPUs.
Explanation
The Problem: A CUDA out-of-memory (OOM) error occurs when you try to load a model larger than your GPU's VRAM. A 7B parameter model in float32 requires ~28GB; most consumer GPUs have 8-24GB. How to Fix It: Three strategies reduce memory use: (1) device_map='auto' splits the model across GPU and CPU, offloading layers dynamically; (2) torch_dtype=torch.bfloat16 cuts memory in half by using half-precision; (3) BitsAndBytesConfig quantizes weights to 4-bit or 8-bit, reducing footprint by 4-8x. When to Use: Start with device_map='auto' and float32: it's safe. If still OOM, add bfloat16. Only quantize if you need to run on extremely constrained hardware (mobile, edge devices, small GPUs).
Analogy
It's like fitting a large painting into a small frame. You can't change the frame size, but you can: (1) hang part of it on the wall outside the frame (device_map), (2) compress the image without much visible loss (bfloat16), or (3) paint the same thing but with fewer colors (quantization).
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
model_name = 'meta-llama/Llama-2-7b-hf'
print('Attempt 1: device_map="auto" with bfloat16')
try:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.bfloat16,
token='hf_YOUR_TOKEN_HERE'
)
print(f'✓ Model loaded successfully')
print(f'Model dtype: {next(model.parameters()).dtype}')
inputs = tokenizer('Hello, my name is', return_tensors='pt')
outputs = model.generate(**inputs, max_length=20, do_sample=False)
print(f'Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}')
except RuntimeError as e:
if 'out of memory' in str(e):
print(f'✗ Still out of memory: {e}')
print('\nAttempt 2: 4-bit quantization')
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map='auto',
token='hf_YOUR_TOKEN_HERE'
)
print(f'✓ Model loaded with 4-bit quantization')
inputs = tokenizer('Hello, my name is', return_tensors='pt')
outputs = model.generate(**inputs, max_length=20, do_sample=False)
print(f'Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}')
else:
raise Attempt 1: device_map="auto" with bfloat16 ✓ Model loaded successfully Model dtype: torch.bfloat16 Generated: Hello, my name is [your generated text] (If OOM on your hardware:) Attempt 2: 4-bit quantization ✓ Model loaded with 4-bit quantization Generated: Hello, my name is [your generated text]
What just happened?
The code attempts to load a 7B model using device_map='auto' (splits model across GPU/CPU) and bfloat16 (half precision). If that fails with OOM, it falls back to 4-bit quantization using BitsAndBytes, which compresses weights to ~1.8GB. Either approach then generates text from a prompt.
Common gotcha
Developers often forget that device_map='auto' without torch_dtype still loads in float32 by default, wasting memory. Always pair device_map='auto' with torch_dtype=torch.bfloat16. Also, quantized models are slightly slower at inference: it's a memory-speed tradeoff, not a free win.
Error recovery
torch.cuda.OutOfMemoryError: CUDA out of memoryValueError: 'bitsandbytes' is not installedRuntimeError: expected scalar type Double but found FloatExperienced dev note
In production, profile your actual peak memory use with `torch.cuda.memory_allocated()` and `torch.cuda.memory_reserved()` before and after loading. device_map='auto' is safe for inference, but for fine-tuning (training) you'll hit OOM faster because backward pass requires activation memory: use gradient checkpointing instead. Also: quantized models can only do inference; you cannot fine-tune them directly in transformers 5.5.x without dequantizing first.
Check your understanding
You have a 13B model that runs out of memory on your 24GB GPU with float32. You add bfloat16 and it still OOMs. Why might 4-bit quantization work, and what tradeoff are you accepting?
Show answer hint
A correct answer mentions that bfloat16 cuts memory by ~2x (still ~52GB effective for weights + activations), while 4-bit cuts by ~8x (~6-7GB). The tradeoff is inference latency and slight accuracy loss due to quantization, neither of which affects generation quality much for open-ended text tasks.