RuntimeError: CUDA out of memory
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB
Stack trace
RuntimeError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 24.00 GiB total capacity; 8.50 GiB already allocated; 4.20 GiB free; 8.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
File "/path/to/site-packages/transformers/modeling_utils.py", line 2675, in _load_state_dict_into_model
model.load_state_dict(state_dict, _fast_load=False)
File "/path/to/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
_load_checkpoint(self, state_dict, strict, load_state_dict_impl)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 24.00 GiB total capacity; 8.50 GiB already allocated; 4.20 GiB free; 8.00 GiB reserved in total by PyTorch) Why it happens
Llama models are large (7B–70B+ parameters). Loading a Llama 3.1 70B model in full precision (float32 or float16) requires 140–280 GB of VRAM depending on the precision: far exceeding consumer GPU capacity (4–80 GB). When AutoModelForCausalLM loads the full model without quantization or sharding, PyTorch tries to allocate the entire model to GPU memory at once, causing CUDA to fail. This is especially common when using smaller GPUs (RTX 3090, RTX 4080) or loading larger variants (Llama 3.3 70B) without memory optimization.
Detection
Monitor GPU memory before model loading using `torch.cuda.get_device_properties()` and `torch.cuda.memory_allocated()`. Compare available VRAM to model size (model params × bytes per parameter). Add logging to catch OOM before it crashes: set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512` in environment and wrap model loading in try/except to detect early.
Causes & fixes
Loading full model in float32 or float16 without quantization on undersized GPU
Use 4-bit or 8-bit quantization via BitsAndBytesConfig: `quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)` and pass to `AutoModelForCausalLM.from_pretrained(..., quantization_config=quantization_config, device_map='auto')`
Not using device_map='auto' for intelligent tensor parallelism or offloading
Add `device_map='auto'` to from_pretrained(): this automatically shards model across available GPUs and CPU offload: `AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.3-70B', device_map='auto', torch_dtype=torch.float16)`
Loading a model too large for available GPU(s) even with quantization (e.g., 70B on single 24GB GPU)
Switch to a smaller model variant (Llama 3.2 3B or 8B instead of 70B) or use quantized inference via vLLM/Ollama which handles memory more efficiently
GPU memory fragmentation or other PyTorch processes hogging VRAM
Set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512` env var before importing torch, clear cache with `torch.cuda.empty_cache()`, and ensure no other GPU jobs are running
Code: broken vs fixed
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# This will fail with CUDA OOM on most GPUs
model_name = 'meta-llama/Llama-3.3-70B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 # ← No quantization, loads full 140GB
) # ← RuntimeError: CUDA out of memory import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# Set up 4-bit quantization to reduce memory 4x
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
)
model_name = 'meta-llama/Llama-3.3-70B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config, # ← Added: 4-bit quantization
device_map='auto', # ← Added: intelligent sharding across GPUs/CPU
torch_dtype=torch.float16,
token=os.environ.get('HF_TOKEN') # ← Use env var for auth
)
print(f'Model loaded successfully. Device map: {model.hf_device_map}')
print(f'GPU memory used: {torch.cuda.memory_allocated() / 1e9:.2f}GB') Workaround
If BitsAndBytes is not available, use Ollama (`ollama pull llama3.2:latest`) which handles quantization and memory internally, or switch to a smaller model like Llama 3.2 3B (`meta-llama/Llama-3.2-3B-Instruct`) which fits on 4–8GB GPUs without quantization.
Prevention
Profile model memory requirements before loading: calculate (param_count × bytes_per_param) vs. available VRAM. Establish GPU memory budget: use smaller models for production on limited VRAM (Llama 3.2 1B/3B for edge), reserve quantization (4-bit) for large models (70B+), and use API-based inference (Groq, Together AI) for models you can't fit locally. Automate device selection: check GPU count and capacity, quantize if needed, offload to CPU if no GPU.