Code Beginner easy · 5 min

CUDA out of memory: model too large

What you will learn
When a model won't fit in GPU memory, use device_map='auto' and quantization to reduce the memory footprint.

Why this matters

Large language models like Llama or Mistral often exceed single-GPU memory. Knowing how to load them efficiently means you can run inference on consumer GPUs instead of paying for cloud TPUs.

Skip if: You don't need these techniques if you're only running inference on small models (< 1B parameters) or if you have access to multiple high-memory GPUs and budget is unlimited.

Explanation

The Problem: A CUDA out-of-memory (OOM) error occurs when you try to load a model larger than your GPU's VRAM. A 7B parameter model in float32 requires ~28GB; most consumer GPUs have 8-24GB. How to Fix It: Three strategies reduce memory use: (1) device_map='auto' splits the model across GPU and CPU, offloading layers dynamically; (2) torch_dtype=torch.bfloat16 cuts memory in half by using half-precision; (3) BitsAndBytesConfig quantizes weights to 4-bit or 8-bit, reducing footprint by 4-8x. When to Use: Start with device_map='auto' and float32: it's safe. If still OOM, add bfloat16. Only quantize if you need to run on extremely constrained hardware (mobile, edge devices, small GPUs).

Analogy

It's like fitting a large painting into a small frame. You can't change the frame size, but you can: (1) hang part of it on the wall outside the frame (device_map), (2) compress the image without much visible loss (bfloat16), or (3) paint the same thing but with fewer colors (quantization).

Code

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

model_name = 'meta-llama/Llama-2-7b-hf'

print('Attempt 1: device_map="auto" with bfloat16')
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map='auto',
        torch_dtype=torch.bfloat16,
        token='hf_YOUR_TOKEN_HERE'
    )
    print(f'✓ Model loaded successfully')
    print(f'Model dtype: {next(model.parameters()).dtype}')
    
    inputs = tokenizer('Hello, my name is', return_tensors='pt')
    outputs = model.generate(**inputs, max_length=20, do_sample=False)
    print(f'Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}')
except RuntimeError as e:
    if 'out of memory' in str(e):
        print(f'✗ Still out of memory: {e}')
        print('\nAttempt 2: 4-bit quantization')
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map='auto',
            token='hf_YOUR_TOKEN_HERE'
        )
        print(f'✓ Model loaded with 4-bit quantization')
        inputs = tokenizer('Hello, my name is', return_tensors='pt')
        outputs = model.generate(**inputs, max_length=20, do_sample=False)
        print(f'Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}')
    else:
        raise
Output
Attempt 1: device_map="auto" with bfloat16
✓ Model loaded successfully
Model dtype: torch.bfloat16
Generated: Hello, my name is [your generated text]

(If OOM on your hardware:)
Attempt 2: 4-bit quantization
✓ Model loaded with 4-bit quantization
Generated: Hello, my name is [your generated text]

What just happened?

The code attempts to load a 7B model using device_map='auto' (splits model across GPU/CPU) and bfloat16 (half precision). If that fails with OOM, it falls back to 4-bit quantization using BitsAndBytes, which compresses weights to ~1.8GB. Either approach then generates text from a prompt.

Common gotcha

Developers often forget that device_map='auto' without torch_dtype still loads in float32 by default, wasting memory. Always pair device_map='auto' with torch_dtype=torch.bfloat16. Also, quantized models are slightly slower at inference: it's a memory-speed tradeoff, not a free win.

Error recovery

torch.cuda.OutOfMemoryError: CUDA out of memory
Reduce model precision (float32 → bfloat16 → 4-bit), increase device_map offloading (set device_map='auto'), or reduce batch size and max_length in generation.
ValueError: 'bitsandbytes' is not installed
Run `pip install bitsandbytes` (requires CUDA-capable GPU; does not work on CPU-only systems).
RuntimeError: expected scalar type Double but found Float
Ensure all inputs (tokenizer output) match model dtype: use tokenizer(..., return_tensors='pt') not .to(torch.float32).

Experienced dev note

In production, profile your actual peak memory use with `torch.cuda.memory_allocated()` and `torch.cuda.memory_reserved()` before and after loading. device_map='auto' is safe for inference, but for fine-tuning (training) you'll hit OOM faster because backward pass requires activation memory: use gradient checkpointing instead. Also: quantized models can only do inference; you cannot fine-tune them directly in transformers 5.5.x without dequantizing first.

Check your understanding

You have a 13B model that runs out of memory on your 24GB GPU with float32. You add bfloat16 and it still OOMs. Why might 4-bit quantization work, and what tradeoff are you accepting?

Show answer hint

A correct answer mentions that bfloat16 cuts memory by ~2x (still ~52GB effective for weights + activations), while 4-bit cuts by ~8x (~6-7GB). The tradeoff is inference latency and slight accuracy loss due to quantization, neither of which affects generation quality much for open-ended text tasks.

VERSION In transformers < 5.0.0, device_map='auto' required separate import from accelerate library. In 5.5.x, it's built-in. Also, BitsAndBytesConfig is the modern pattern; legacy bnb_8bit_* arguments are deprecated in 5.5.x.
NEXT

Now that your model fits in memory, learn how to batch multiple prompts together efficiently using tokenizer.pad_token_id and attention_mask to avoid OOM on input data.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.