Hardware requirements for full parameter fine-tuning
Why this matters
Most developers underestimate memory requirements and waste weeks on failed training runs or cloud bill shock. Understanding the exact formula prevents both.
Explanation
What it is: Full parameter fine-tuning means every weight in the model is trainable (no frozen layers). This requires storing activations, gradients, and optimizer states in VRAM simultaneously: roughly 4x the model's base size in memory.
How it works mechanically: When you set peft_config=None or don't use PEFT at all, the trainer updates all parameters. For each forward pass, PyTorch stores activation tensors. Each backward pass creates gradient tensors (same size as parameters). AdamW optimizer maintains two states per parameter (momentum and variance), doubling gradient size. Total VRAM ≈ model_params * 4 bytes * (1 forward activations + 2 AdamW states) + batch overhead. A 7B parameter model needs roughly 112GB just for the base model + gradients + optimizer state at fp32 precision.
When to use it: Use full fine-tuning only when you have dedicated enterprise hardware (8x H100s or 16x A100s), your domain requires complete model adaptation, or you're training proprietary models where parameter efficiency matters less than quality. For research or commercial work under budget constraints, use PEFT methods instead.
Analogy
Full fine-tuning is like renovating every room in a house and keeping all construction materials on-site. Parameter-efficient fine-tuning is like hiring a contractor who optimizes what they keep stored and uses scaffolding efficiently.
Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map="cpu"
)
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
base_model_memory_gb = (total_params * 4) / (1024**3)
gradient_memory_gb = (trainable_params * 4) / (1024**3)
adamw_state_memory_gb = (trainable_params * 8) / (1024**3)
total_memory_fp32_gb = base_model_memory_gb + gradient_memory_gb + adamw_state_memory_gb
total_memory_fp16_gb = base_model_memory_gb / 2 + gradient_memory_gb / 2 + adamw_state_memory_gb / 2
batch_size = 1
seq_length = 2048
batch_activation_memory_gb = (batch_size * seq_length * 4096 * 4) / (1024**3)
final_memory_fp32_gb = total_memory_fp32_gb + batch_activation_memory_gb
final_memory_fp16_gb = total_memory_fp16_gb + batch_activation_memory_gb
print(f"Model: {model_name}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"\nMemory breakdown (FP32):")
print(f" Base model: {base_model_memory_gb:.2f} GB")
print(f" Gradients: {gradient_memory_gb:.2f} GB")
print(f" AdamW states: {adamw_state_memory_gb:.2f} GB")
print(f" Batch activations (bs={batch_size}, seq={seq_length}): {batch_activation_memory_gb:.2f} GB")
print(f" TOTAL (FP32): {final_memory_fp32_gb:.2f} GB")
print(f" TOTAL (FP16): {final_memory_fp16_gb:.2f} GB")
print(f"\nGPU recommendation (FP32): {int(final_memory_fp32_gb / 80) + 1}x A100-80GB or {int(final_memory_fp32_gb / 141) + 1}x H100-141GB")
print(f"Training time estimate: {(final_memory_fp32_gb / 80) * 10} GPU-hours per epoch (rough)")
del model
torch.cuda.empty_cache() Model: meta-llama/Llama-2-7b-hf Total parameters: 6,738,415,616 Trainable parameters: 6,738,415,616 Memory breakdown (FP32): Base model: 25.72 GB Gradients: 25.72 GB AdamW states: 51.43 GB Batch activations (bs=1, seq=2048): 0.03 GB TOTAL (FP32): 102.91 GB TOTAL (FP16): 51.46 GB GPU recommendation (FP32): 2x A100-80GB or 1x H100-141GB Training time estimate: 12.863749999999999 GPU-hours per epoch (rough)
What just happened?
The code loaded the Llama-2-7B model, counted every parameter, then calculated memory usage by multiplying parameter count by 4 bytes (FP32) or 2 bytes (FP16). It then added up the three mandatory memory consumers: the model weights themselves, one gradient tensor per parameter, and two optimizer state tensors per parameter (momentum + variance for AdamW). Finally it estimated the minimum GPU configuration and rough training time, showing that even a 7B model requires dual A100s for full fine-tuning.
Common gotcha
Developers often forget that trainer.train() with peft_config=None silently uses FP32 by default, not FP16. Setting fp16=True in your training args cuts memory in half, but many engineers only discover this after their training OOMs. Also: batch activation memory scales with batch_size * seq_length * hidden_dim: a batch_size of 16 makes activation memory 16x larger, which is why gradient accumulation steps are essential on smaller GPUs.
Error recovery
torch.cuda.OutOfMemoryErrorRuntimeError: CUDA out of memoryTypeError: unsupported operand type(s)Experienced dev note
The real production surprise: memory doesn't scale linearly with batch size or sequence length: activation checkpointing and mixed precision interact in non-obvious ways with your specific hardware. A 7B model that fits on 2x A100s with batch_size=1 might need 4x A100s at batch_size=4, not 2.25x. Before committing to a multi-day training run, always do a 10-step dry run with your exact config and watch nvidia-smi. Also: inter-GPU communication overhead on multi-GPU setups means you gain roughly 0.85x speedup per additional GPU, not 1.0x, so scaling beyond 4 GPUs gives diminishing returns.
Check your understanding
You calculated that a 7B model needs 102GB for full FP32 fine-tuning but only 51GB for FP16. If you use gradient accumulation with accumulation_steps=4 and batch_size=1, does your VRAM requirement change? Why or why not?
Show answer hint
Correct answer recognizes that gradient accumulation does NOT reduce peak VRAM for model + gradients + optimizer states: it only reduces throughput. Peak memory still includes the full model, full gradients, and full optimizer states in VRAM simultaneously. Accumulation steps only affect how many forward-backward cycles happen before a weight update, not the size of intermediate tensors stored.