High severity intermediate · Fix: 5-15 min

RuntimeError: CUDA out of memory

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB

What this error means

PyTorch/Transformers cannot allocate GPU memory to load the Llama model weights because total model size exceeds available VRAM on your GPU(s).

Stack trace

traceback

RuntimeError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 24.00 GiB total capacity; 8.50 GiB already allocated; 4.20 GiB free; 8.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

File "/path/to/site-packages/transformers/modeling_utils.py", line 2675, in _load_state_dict_into_model
    model.load_state_dict(state_dict, _fast_load=False)

File "/path/to/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    _load_checkpoint(self, state_dict, strict, load_state_dict_impl)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 24.00 GiB total capacity; 8.50 GiB already allocated; 4.20 GiB free; 8.00 GiB reserved in total by PyTorch)

QUICK FIX

Add `quantization_config=BitsAndBytesConfig(load_in_4bit=True)` and `device_map='auto'` to `AutoModelForCausalLM.from_pretrained()` to reduce memory footprint by 4x instantly.

Why it happens

Llama models are large (7B–70B+ parameters). Loading a Llama 3.1 70B model in full precision (float32 or float16) requires 140–280 GB of VRAM depending on the precision: far exceeding consumer GPU capacity (4–80 GB). When AutoModelForCausalLM loads the full model without quantization or sharding, PyTorch tries to allocate the entire model to GPU memory at once, causing CUDA to fail. This is especially common when using smaller GPUs (RTX 3090, RTX 4080) or loading larger variants (Llama 3.3 70B) without memory optimization.

Detection

Monitor GPU memory before model loading using `torch.cuda.get_device_properties()` and `torch.cuda.memory_allocated()`. Compare available VRAM to model size (model params × bytes per parameter). Add logging to catch OOM before it crashes: set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512` in environment and wrap model loading in try/except to detect early.

Causes & fixes

Loading full model in float32 or float16 without quantization on undersized GPU

✓ Fix

Use 4-bit or 8-bit quantization via BitsAndBytesConfig: `quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)` and pass to `AutoModelForCausalLM.from_pretrained(..., quantization_config=quantization_config, device_map='auto')`

Not using device_map='auto' for intelligent tensor parallelism or offloading

✓ Fix

Add `device_map='auto'` to from_pretrained(): this automatically shards model across available GPUs and CPU offload: `AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.3-70B', device_map='auto', torch_dtype=torch.float16)`

Loading a model too large for available GPU(s) even with quantization (e.g., 70B on single 24GB GPU)

✓ Fix

Switch to a smaller model variant (Llama 3.2 3B or 8B instead of 70B) or use quantized inference via vLLM/Ollama which handles memory more efficiently

GPU memory fragmentation or other PyTorch processes hogging VRAM

✓ Fix

Set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512` env var before importing torch, clear cache with `torch.cuda.empty_cache()`, and ensure no other GPU jobs are running

Code: broken vs fixed

Broken - triggers the error

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# This will fail with CUDA OOM on most GPUs
model_name = 'meta-llama/Llama-3.3-70B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16  # ← No quantization, loads full 140GB
)  # ← RuntimeError: CUDA out of memory

Fixed - works correctly

python

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Set up 4-bit quantization to reduce memory 4x
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)

model_name = 'meta-llama/Llama-3.3-70B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,  # ← Added: 4-bit quantization
    device_map='auto',  # ← Added: intelligent sharding across GPUs/CPU
    torch_dtype=torch.float16,
    token=os.environ.get('HF_TOKEN')  # ← Use env var for auth
)

print(f'Model loaded successfully. Device map: {model.hf_device_map}')
print(f'GPU memory used: {torch.cuda.memory_allocated() / 1e9:.2f}GB')

BitsAndBytesConfig with load_in_4bit reduces model size by 4x (140GB → 35GB), and device_map='auto' intelligently distributes tensors across GPUs or CPU offload, preventing OOM on single-GPU setups.

⚠

Workaround

If BitsAndBytes is not available, use Ollama (`ollama pull llama3.2:latest`) which handles quantization and memory internally, or switch to a smaller model like Llama 3.2 3B (`meta-llama/Llama-3.2-3B-Instruct`) which fits on 4–8GB GPUs without quantization.

✓

Prevention

Profile model memory requirements before loading: calculate (param_count × bytes_per_param) vs. available VRAM. Establish GPU memory budget: use smaller models for production on limited VRAM (Llama 3.2 1B/3B for edge), reserve quantization (4-bit) for large models (70B+), and use API-based inference (Groq, Together AI) for models you can't fit locally. Automate device selection: check GPU count and capacity, quantize if needed, offload to CPU if no GPU.

Python 3.9+ · transformers >=4.30.0 · tested on 4.45.0+

Verified 2026-04 · llama-3.3-70B-Instruct, llama-3.2-3B-Instruct, llama-3.2-1B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.