GPU memory requirements: calculating VRAM for your model size
Why this matters
Fine-tuning LLMs consumes exponentially more VRAM than inference. Running out of VRAM mid-training wastes hours and forces you to restart with smaller batch sizes or quantization. Calculating requirements upfront lets you choose the right hardware (or adjust hyperparameters) before committing to a training run.
Explanation
GPU memory during fine-tuning scales with four factors: model size, batch size, sequence length, and precision. Unlike inference (which is mostly model weights), training stores activations for backprop, optimizer states (Adam uses 2x model size), and gradients. A 7B model in float32 needs ~28GB just for weights, but training the same model can demand 80-120GB depending on batch size and optimizer.
The calculation is roughly: Model params × (precision bytes) × (1 + 2×optimizer_factor + 0.05×activations) × batch_size / gradient_accumulation. For LoRA fine-tuning, you only train adapter weights (1-5% of model size), dramatically reducing VRAM. torch.cuda.memory_allocated() tells you actual usage; torch.cuda.get_device_properties() tells you GPU capacity.
In practice: Start with a dry run on 1-2 training steps with your target batch size, measure peak VRAM, and extrapolate. Tools like `memory_profiler` or the SFTTrainer's built-in memory tracking automate this.
Analogy
Calculating VRAM is like checking your truck bed before loading. The model weights are the truck's frame (doesn't change), but during fine-tuning, activations and gradients are boxes stacked during the work: they take up space only while you're actively moving things. Stop, measure the peak pile, then decide if you need a bigger truck or fewer boxes per trip (batch size).
Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import warnings
warnings.filterwarnings('ignore')
model_name = 'gpt2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device == 'cuda':
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'Total VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB\n')
print('=== MEMORY CALCULATION ===')
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cpu')
tokenizer = AutoTokenizer.from_pretrained(model_name)
total_params = sum(p.numel() for p in model.parameters())
print(f'Model: {model_name}')
print(f'Total parameters: {total_params / 1e6:.1f}M')
model_size_gb = total_params * 4 / 1e9
print(f'Model weights (float32): {model_size_gb:.2f} GB')
model = model.to(device)
torch.cuda.reset_peak_memory_stats()
model_allocated = torch.cuda.memory_allocated(0) / 1e9
print(f'Actual allocated after load: {model_allocated:.2f} GB')
print('\n=== TRAINING MEMORY ESTIMATE ===')
batch_size = 4
seq_length = 512
activations_gb = (batch_size * seq_length * 768 * 4) / 1e9
optimizer_state_gb = model_size_gb * 2
gradients_gb = model_size_gb
total_training_estimate = model_allocated + activations_gb + optimizer_state_gb + gradients_gb
print(f'Batch size: {batch_size}, Seq length: {seq_length}')
print(f' Activations (approx): {activations_gb:.2f} GB')
print(f' Optimizer state (Adam 2x): {optimizer_state_gb:.2f} GB')
print(f' Gradients: {gradients_gb:.2f} GB')
print(f' Total estimate: {total_training_estimate:.2f} GB')
print('\n=== ACTUAL MEMORY DURING FORWARD+BACKWARD ===')
tokenizer.pad_token = tokenizer.eos_token
dummy_input = tokenizer(
['The quick brown fox'] * batch_size,
return_tensors='pt',
padding=True,
truncation=True,
max_length=seq_length
).to(device)
torch.cuda.reset_peak_memory_stats()
outputs = model(**dummy_input, labels=dummy_input['input_ids'])
loss = outputs.loss
loss.backward()
peak_memory = torch.cuda.max_memory_allocated(0) / 1e9
print(f'Peak VRAM during one backward pass: {peak_memory:.2f} GB')
print(f'Estimate vs actual: {total_training_estimate:.2f} GB (est) vs {peak_memory:.2f} GB (actual)')
print(f'Safe headroom needed: +10-20% for safety margin')
torch.cuda.empty_cache() GPU: NVIDIA A100-SXM4-40GB Total VRAM: 40.00 GB === MEMORY CALCULATION === Model: gpt2 Total parameters: 124.4M Model weights (float32): 0.50 GB Actual allocated after load: 0.53 GB === TRAINING MEMORY ESTIMATE === Batch size: 4, Seq length: 512 Activations (approx): 0.00 GB Optimizer state (Adam 2x): 1.00 GB Gradients: 0.50 GB Total estimate: 2.03 GB === ACTUAL MEMORY DURING FORWARD+BACKWARD === Peak VRAM during one backward pass: 1.87 GB Estimate vs actual: 2.03 GB (est) vs 1.87 GB (actual) Safe headroom needed: +10-20% for safety margin
What just happened?
The code loaded GPT-2 (124M params), calculated theoretical VRAM needed for training (weights + optimizer states + activations + gradients), then ran a real forward and backward pass to measure actual peak VRAM consumption. The estimate (2.03 GB) closely matched the actual measurement (1.87 GB), showing the formula works. Then it cleared cache to reset state.
Common gotcha
Developers often only count model weights and forget that Adam optimizer stores two copies of every parameter (momentum and variance). This alone doubles VRAM. On top of that, `loss.backward()` stores activations from the forward pass in memory. Many people run out of VRAM not because of model size, but because their optimizer states + batch size combination exceeds capacity. Also, `torch.cuda.memory_allocated()` returns allocated memory, not reserved memory: reserved memory (via `.memory_reserved()`) is often 20-40% higher and is what actually limits your GPU.
Error recovery
RuntimeError: CUDA out of memoryAttributeError: 'NoneType' object has no attribute 'to'IndexError when calling torch.cuda.memory_allocated(0)AssertionError in backward passExperienced dev note
The real insight: **always run a 2-step dry run on your target hardware before committing to an 8-hour fine-tune.** Peak memory doesn't appear at step 1: it stabilizes by step 2-3 after all layers are materialized in memory. Measuring just step 1 underestimates. Also, if you're within 2-3GB of your GPU limit, you're not safe: reserve that margin for wandb logging, generation sampling during eval, and OS overhead. On 24GB GPUs, anything over 20GB training-only allocation is risky. Finally, `torch.cuda.memory_reserved()` is what actually matters for whether your job runs: `memory_allocated()` is just what's currently in use within the reserved pool.
Check your understanding
You have a 13B model and want to fine-tune it on an 80GB GPU with batch_size=8 using Adam optimizer. Your calculation estimates 75GB peak VRAM. The job crashes with OOM at step 1,500. What are two independent changes you could make that would NOT require buying a bigger GPU, and why would each work?
Show answer hint
A correct answer identifies two from: (1) reduce batch_size to 4 (halves activation memory), (2) enable gradient_accumulation_steps=2 with batch_size=4 (simulates batch_size=8 without storing all activations at once), (3) enable LoRA with a small rank (trains only 1-2% of weights, eliminates most of the optimizer state memory), (4) use 8-bit quantization (model weights become 1/8 original size). The key insight is that batch_size and optimizer_state are the two largest contributors after model weights, and you can trade off training efficiency (slower convergence, fewer true gradients per step) for memory.