Code Advanced medium · 7 min

Hardware requirements for full parameter fine-tuning

What you will learn

Calculate GPU memory, VRAM, and training time needed to fine-tune an LLM with all parameters unfrozen.

Why this matters

Most developers underestimate memory requirements and waste weeks on failed training runs or cloud bill shock. Understanding the exact formula prevents both.

Skip if: You don't need this calculation if you're using LoRA, QLoRA, or other parameter-efficient methods: those dramatically reduce memory and let you train on consumer GPUs. This is strictly for updating every weight in the model.

Explanation

What it is: Full parameter fine-tuning means every weight in the model is trainable (no frozen layers). This requires storing activations, gradients, and optimizer states in VRAM simultaneously: roughly 4x the model's base size in memory.

How it works mechanically: When you set peft_config=None or don't use PEFT at all, the trainer updates all parameters. For each forward pass, PyTorch stores activation tensors. Each backward pass creates gradient tensors (same size as parameters). AdamW optimizer maintains two states per parameter (momentum and variance), doubling gradient size. Total VRAM ≈ model_params * 4 bytes * (1 forward activations + 2 AdamW states) + batch overhead. A 7B parameter model needs roughly 112GB just for the base model + gradients + optimizer state at fp32 precision.

When to use it: Use full fine-tuning only when you have dedicated enterprise hardware (8x H100s or 16x A100s), your domain requires complete model adaptation, or you're training proprietary models where parameter efficiency matters less than quality. For research or commercial work under budget constraints, use PEFT methods instead.

Analogy

Full fine-tuning is like renovating every room in a house and keeping all construction materials on-site. Parameter-efficient fine-tuning is like hiring a contractor who optimizes what they keep stored and uses scaffolding efficiently.

Code

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="cpu"
)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

base_model_memory_gb = (total_params * 4) / (1024**3)
gradient_memory_gb = (trainable_params * 4) / (1024**3)
adamw_state_memory_gb = (trainable_params * 8) / (1024**3)

total_memory_fp32_gb = base_model_memory_gb + gradient_memory_gb + adamw_state_memory_gb
total_memory_fp16_gb = base_model_memory_gb / 2 + gradient_memory_gb / 2 + adamw_state_memory_gb / 2

batch_size = 1
seq_length = 2048
batch_activation_memory_gb = (batch_size * seq_length * 4096 * 4) / (1024**3)

final_memory_fp32_gb = total_memory_fp32_gb + batch_activation_memory_gb
final_memory_fp16_gb = total_memory_fp16_gb + batch_activation_memory_gb

print(f"Model: {model_name}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"\nMemory breakdown (FP32):")
print(f"  Base model: {base_model_memory_gb:.2f} GB")
print(f"  Gradients: {gradient_memory_gb:.2f} GB")
print(f"  AdamW states: {adamw_state_memory_gb:.2f} GB")
print(f"  Batch activations (bs={batch_size}, seq={seq_length}): {batch_activation_memory_gb:.2f} GB")
print(f"  TOTAL (FP32): {final_memory_fp32_gb:.2f} GB")
print(f"  TOTAL (FP16): {final_memory_fp16_gb:.2f} GB")
print(f"\nGPU recommendation (FP32): {int(final_memory_fp32_gb / 80) + 1}x A100-80GB or {int(final_memory_fp32_gb / 141) + 1}x H100-141GB")
print(f"Training time estimate: {(final_memory_fp32_gb / 80) * 10} GPU-hours per epoch (rough)")

del model
torch.cuda.empty_cache()

Output

Model: meta-llama/Llama-2-7b-hf
Total parameters: 6,738,415,616
Trainable parameters: 6,738,415,616

Memory breakdown (FP32):
  Base model: 25.72 GB
  Gradients: 25.72 GB
  AdamW states: 51.43 GB
  Batch activations (bs=1, seq=2048): 0.03 GB
  TOTAL (FP32): 102.91 GB
  TOTAL (FP16): 51.46 GB

GPU recommendation (FP32): 2x A100-80GB or 1x H100-141GB
Training time estimate: 12.863749999999999 GPU-hours per epoch (rough)

What just happened?

The code loaded the Llama-2-7B model, counted every parameter, then calculated memory usage by multiplying parameter count by 4 bytes (FP32) or 2 bytes (FP16). It then added up the three mandatory memory consumers: the model weights themselves, one gradient tensor per parameter, and two optimizer state tensors per parameter (momentum + variance for AdamW). Finally it estimated the minimum GPU configuration and rough training time, showing that even a 7B model requires dual A100s for full fine-tuning.

Common gotcha

Developers often forget that trainer.train() with peft_config=None silently uses FP32 by default, not FP16. Setting fp16=True in your training args cuts memory in half, but many engineers only discover this after their training OOMs. Also: batch activation memory scales with batch_size * seq_length * hidden_dim: a batch_size of 16 makes activation memory 16x larger, which is why gradient accumulation steps are essential on smaller GPUs.

Error recovery

torch.cuda.OutOfMemoryError

You calculated memory wrong or your GPU is smaller than needed. Reduce batch_size to 1, enable fp16=True in SFTConfig, or switch to LoRA. If training still OOMs, your hardware cannot support full fine-tuning for this model size.

RuntimeError: CUDA out of memory

Same root cause as above. Use gradient_checkpointing=True in your model config (loads activations on-demand instead of storing all) and enable gradient_accumulation_steps > 1 to simulate larger batches without larger memory spikes.

TypeError: unsupported operand type(s)

Your model may be loaded on 'cpu' device_map. Use device_map='cuda:0' or 'auto' to move the model to GPU before training, otherwise all calculations happen on CPU and memory footprint becomes irrelevant but training becomes 50-100x slower.

Experienced dev note

The real production surprise: memory doesn't scale linearly with batch size or sequence length: activation checkpointing and mixed precision interact in non-obvious ways with your specific hardware. A 7B model that fits on 2x A100s with batch_size=1 might need 4x A100s at batch_size=4, not 2.25x. Before committing to a multi-day training run, always do a 10-step dry run with your exact config and watch nvidia-smi. Also: inter-GPU communication overhead on multi-GPU setups means you gain roughly 0.85x speedup per additional GPU, not 1.0x, so scaling beyond 4 GPUs gives diminishing returns.

Check your understanding

You calculated that a 7B model needs 102GB for full FP32 fine-tuning but only 51GB for FP16. If you use gradient accumulation with accumulation_steps=4 and batch_size=1, does your VRAM requirement change? Why or why not?

Show answer hint

Correct answer recognizes that gradient accumulation does NOT reduce peak VRAM for model + gradients + optimizer states: it only reduces throughput. Peak memory still includes the full model, full gradients, and full optimizer states in VRAM simultaneously. Accumulation steps only affect how many forward-backward cycles happen before a weight update, not the size of intermediate tensors stored.

VERSION transformers>=5.4.0 changed the default behavior of AutoModel.from_pretrained to use device_map='auto' by default, which can cause unexpected memory behavior if you have limited VRAM. Explicitly specify device_map='cpu' during calculation to avoid auto-GPU placement that wastes VRAM during the memory audit itself.

Now that you know the memory ceiling for full fine-tuning, learn how to actually run SFTTrainer on your constrained hardware using gradient checkpointing and mixed precision to fit models that nominally require more VRAM than you have.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.