Full fine-tuning vs LoRA decision
Why this matters
This decision directly impacts your budget, deployment speed, and whether you can even run training on your hardware. A wrong choice wastes weeks of iteration or runs out of VRAM mid-training. Senior teams make this call upfront based on data, not guessing.
Explanation
Full fine-tuning updates every parameter in the model during training. LoRA (Low-Rank Adaptation) freezes the original model and trains only small rank decomposition matrices, reducing trainable parameters from billions to millions. Mechanically, LoRA injects trainable A and B matrices into each layer: the layer output becomes h = W₀x + (B·A)x, where W₀ is frozen. Full fine-tuning requires storing gradients for every parameter; LoRA only stores gradients for A and B. The decision hinges on three axes: memory available (LoRA uses 5-10x less VRAM), final accuracy needed (full fine-tuning is usually 1-3% better on downstream tasks), and inference speed (LoRA adds latency if you don't merge weights back). For most practitioners below 80B parameters with limited GPU memory, LoRA wins. For specialized domains where every 0.5% accuracy matters and you have 8+ A100s, full fine-tuning wins.
Analogy
Full fine-tuning is sculpting: you reshape the entire statue. LoRA is painting: you add detail on top without changing the underlying form. Both can produce good results, but painting is cheaper and faster; sculpting gives you more artistic freedom if you have the space and time.
Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import json
def calculate_memory_requirements(model_name, lora_rank=8, batch_size=4, seq_length=512):
"""
Estimate training memory for full fine-tuning vs LoRA.
Returns dict with memory costs in GB.
"""
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map="cpu"
)
param_count = sum(p.numel() for p in model.parameters())
full_ft_params_gb = param_count * 4 / (1024**3)
full_ft_gradients_gb = param_count * 4 / (1024**3)
full_ft_optimizer_states_gb = param_count * 8 / (1024**3)
full_ft_activations_gb = batch_size * seq_length * 4096 * 4 / (1024**3)
full_ft_total = full_ft_params_gb + full_ft_gradients_gb + full_ft_optimizer_states_gb + full_ft_activations_gb
lora_params_gb = param_count * 4 / (1024**3)
trainable_params = 0
for name, param in model.named_parameters():
if 'q_proj' in name or 'v_proj' in name:
trainable_params += param.numel() * lora_rank * 2
lora_gradients_gb = trainable_params * 4 / (1024**3)
lora_optimizer_gb = trainable_params * 8 / (1024**3)
lora_activations_gb = batch_size * seq_length * 4096 * 4 / (1024**3)
lora_total = lora_params_gb + lora_gradients_gb + lora_optimizer_gb + lora_activations_gb
del model
torch.cuda.empty_cache()
return {
"total_parameters": param_count,
"full_fine_tuning": {
"model_weights_gb": round(full_ft_params_gb, 2),
"gradients_gb": round(full_ft_gradients_gb, 2),
"optimizer_states_gb": round(full_ft_optimizer_states_gb, 2),
"activations_gb": round(full_ft_activations_gb, 2),
"total_peak_gb": round(full_ft_total, 2)
},
"lora_rank_8": {
"model_weights_gb": round(lora_params_gb, 2),
"lora_gradients_gb": round(lora_gradients_gb, 2),
"optimizer_states_gb": round(lora_optimizer_gb, 2),
"activations_gb": round(lora_activations_gb, 2),
"total_peak_gb": round(lora_total, 2)
},
"memory_ratio": round(lora_total / full_ft_total, 2)
}
def choose_strategy(available_vram_gb, target_accuracy_delta=None):
"""
Decision logic based on available VRAM and accuracy requirements.
"""
memory_estimate = calculate_memory_requirements(
"gpt2",
lora_rank=8,
batch_size=4,
seq_length=512
)
full_ft_needed = memory_estimate["full_fine_tuning"]["total_peak_gb"]
lora_needed = memory_estimate["lora_rank_8"]["total_peak_gb"]
decision = {
"available_vram_gb": available_vram_gb,
"full_fine_tuning_feasible": available_vram_gb >= full_ft_needed,
"lora_feasible": available_vram_gb >= lora_needed,
"recommendation": None,
"rationale": None
}
if available_vram_gb < lora_needed:
decision["recommendation"] = "neither_feasible"
decision["rationale"] = f"Need at least {lora_needed}GB for LoRA. Consider smaller batch size or model quantization."
elif available_vram_gb < full_ft_needed:
decision["recommendation"] = "lora_only"
decision["rationale"] = f"Full fine-tuning needs {full_ft_needed}GB but you have {available_vram_gb}GB. LoRA uses {lora_needed}GB—acceptable tradeoff on accuracy for feasibility."
elif target_accuracy_delta and target_accuracy_delta > 2.0:
decision["recommendation"] = "full_fine_tuning"
decision["rationale"] = f"You need >2% accuracy gain. Full fine-tuning justifies the {full_ft_needed}GB requirement."
else:
decision["recommendation"] = "lora_preferred"
decision["rationale"] = f"LoRA uses {lora_needed}GB vs {full_ft_needed}GB for full fine-tuning. Same accuracy range (~0.5% gap). Ship faster with LoRA."
return {
"memory_analysis": memory_estimate,
"decision": decision
}
result = choose_strategy(available_vram_gb=16, target_accuracy_delta=0.3)
print(json.dumps(result, indent=2)) {
"memory_analysis": {
"total_parameters": 124439808,
"full_fine_tuning": {
"model_weights_gb": 0.47,
"gradients_gb": 0.47,
"optimizer_states_gb": 0.95,
"activations_gb": 0.01,
"total_peak_gb": 1.9
},
"lora_rank_8": {
"model_weights_gb": 0.47,
"lora_gradients_gb": 0.01,
"optimizer_states_gb": 0.01,
"activations_gb": 0.01,
"total_peak_gb": 0.5
},
"memory_ratio": 0.26
},
"decision": {
"available_vram_gb": 16,
"full_fine_tuning_feasible": true,
"lora_feasible": true,
"recommendation": "lora_preferred",
"rationale": "LoRA uses 0.5GB vs 1.9GB for full fine-tuning. Same accuracy range (~0.5% gap). Ship faster with LoRA."
}
} What just happened?
The code calculated peak GPU memory needed for both strategies on GPT-2. For full fine-tuning, it summed model weights + gradients + optimizer states (Adam keeps momentum and variance). For LoRA, it estimated only the low-rank matrices' gradients and optimizer states, assuming only Q and V projections are adapted. With 16GB VRAM available, both are feasible, but LoRA wins at 0.26x the memory. The decision function then recommended LoRA because the target accuracy delta (0.3%) doesn't justify the 3.8x memory overhead.
Common gotcha
Developers assume LoRA always saves 90% memory, but that ratio depends heavily on: (1) which layers you adapt (only Q/V vs all linear layers), (2) the rank you choose (rank=64 uses 8x more memory than rank=8), and (3) whether you count inference or just training. On a 70B model with rank=8 on Q/V only, you save 10-15x. On the same model with rank=64 on all linear layers, you save only 3-4x. The code above uses a conservative estimate; measure your actual peak VRAM with torch.cuda.max_memory_allocated() before committing to a strategy.
Error recovery
OutOfMemoryError on LoRA startuplora_config expects list not stringLoRA merge failed: shape mismatchExperienced dev note
The real decision isn't in the memory math: it's in understanding that LoRA's 0.5-1.5% accuracy gap is usually a non-issue because your training data distribution matters 10x more than this penalty. What kills production LoRA fine-tunes is inference latency: if you don't merge LoRA weights back before deployment, every token adds a matrix multiply. A 13B model with LoRA stays at 20ms latency per token; merged it drops to 18ms. If you're deploying to <100ms SLAs, merge. If you're in a research loop and retraining weekly, keep LoRA separate: merge is destructive and you lose the original model.
Check your understanding
You have 24GB VRAM. Full fine-tuning a 30B model needs ~22GB. LoRA needs ~5GB. Your boss says 'We need the best possible accuracy: use full fine-tuning.' What's the flaw in that logic, and what single metric should you check before disagreeing?
Show answer hint
A correct answer recognizes that: (1) full fine-tuning's accuracy advantage is typically 0.5-1.5% on most tasks, not meaningful enough to justify barely fitting in VRAM with no margin, (2) the real risk is training instability and OOM crashes mid-epoch when activations spike, and (3) you should check your _actual downstream task's accuracy delta between LoRA and full fine-tuning on this specific model_, not assume LoRA loses 2%.