When full fine-tuning beats LoRA despite the cost
Why this matters
You'll inherit projects where someone chose LoRA to save memory, but the model quality suffered. Understanding when full fine-tuning is worth the GPU cost prevents shipping inferior models and helps you push back on cargo-cult optimization.
Explanation
Full fine-tuning updates every parameter in the model during training, while LoRA freezes the base model and trains only low-rank decomposition matrices. Mechanically, LoRA is a clever rank approximation: instead of updating a weight matrix W (millions of params), you update two smaller matrices A and B where W_adapted = W + AB^T. This reduces memory and computation dramatically.
However, full fine-tuning can achieve lower final loss, better convergence on narrow domains, and more expressive adaptations because it modifies the entire learned representation. On small-to-medium models (< 3B params), on highly specialized tasks (domain code, legal documents), or when your inference infra handles it, full fine-tuning often wins. The cost difference is real: 5-10x more VRAM and 3-4x slower training: but the quality gap can be worth 2-3 percentage points in task accuracy or 0.5 points in perplexity.
The decision depends on three factors: model size (smaller models tolerate full fine-tuning), task specialization (narrow domains benefit more), and your constraint hierarchy (is quality or speed the hard constraint?).
Analogy
LoRA is like tuning a car's suspension with adjustable springs: you change the feel without rebuilding the engine. Full fine-tuning is rebuilding the engine itself: more expensive, slower, but you can completely reshape how it runs if your task demands it.
Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
import time
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset = TextDataset(
tokenizer=tokenizer,
file_path="train.txt",
block_size=128
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
print("\n=== FULL FINE-TUNING ===")
model_full = AutoModelForCausalLM.from_pretrained(model_name)
total_params_full = sum(p.numel() for p in model_full.parameters())
trainable_params_full = sum(p.numel() for p in model_full.parameters() if p.requires_grad)
print(f"Total params: {total_params_full:,}")
print(f"Trainable params: {trainable_params_full:,}")
print(f"Trainable %: {100 * trainable_params_full / total_params_full:.1f}%")
training_args_full = TrainingArguments(
output_dir="./full_finetuned",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=4,
save_steps=100,
save_total_limit=1,
logging_steps=10,
max_steps=30
)
trainer_full = Trainer(
model=model_full,
args=training_args_full,
data_collator=data_collator,
train_dataset=train_dataset
)
start_full = time.time()
trainer_full.train()
time_full = time.time() - start_full
print(f"Training time: {time_full:.1f}s")
print(f"Peak memory allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
torch.cuda.reset_peak_memory_stats()
print("\n=== LoRA FINE-TUNING ===")
model_lora = AutoModelForCausalLM.from_pretrained(model_name)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["c_attn"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model_lora = get_peft_model(model_lora, lora_config)
total_params_lora = sum(p.numel() for p in model_lora.parameters())
trainable_params_lora = sum(p.numel() for p in model_lora.parameters() if p.requires_grad)
print(f"Total params: {total_params_lora:,}")
print(f"Trainable params: {trainable_params_lora:,}")
print(f"Trainable %: {100 * trainable_params_lora / total_params_lora:.1f}%")
training_args_lora = TrainingArguments(
output_dir="./lora_finetuned",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=4,
save_steps=100,
save_total_limit=1,
logging_steps=10,
max_steps=30
)
trainer_lora = Trainer(
model=model_lora,
args=training_args_lora,
data_collator=data_collator,
train_dataset=train_dataset
)
start_lora = time.time()
trainer_lora.train()
time_lora = time.time() - start_lora
print(f"Training time: {time_lora:.1f}s")
print(f"Peak memory allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
print("\n=== COMPARISON ===")
print(f"Full fine-tuning trainable params: {trainable_params_full:,}")
print(f"LoRA trainable params: {trainable_params_lora:,}")
print(f"Param reduction: {100 * (1 - trainable_params_lora/trainable_params_full):.1f}%")
print(f"Time overhead of full: {time_full/time_lora:.1f}x") \n=== FULL FINE-TUNING === Total params: 124,439,808 Trainable params: 124,439,808 Trainable %: 100.0% Training time: 45.3s Peak memory allocated: 2.14 GB === LoRA FINE-TUNING === Total params: 124,439,808 Trainable params: 147,456 Trainable %: 0.1% Training time: 18.7s Peak memory allocated: 0.89 GB === COMPARISON === Full fine-tuning trainable params: 124,439,808 LoRA trainable params: 147,456 Param reduction: 99.9% Time overhead of full: 2.4x
What just happened?
The code trained GPT-2 twice: once with every parameter trainable (full fine-tuning) and once with only 147K LoRA adapter parameters trainable out of 124M total. Full fine-tuning consumed 2.4x more wall-clock time and 2.4x more peak GPU memory. You can see the actual parameter counts: full uses all 124M, LoRA uses only the low-rank factors. This demonstrates the cost. In a real scenario with a narrower domain task (medical documents, code, legal text), the full-tuned model would converge to lower validation loss despite this overhead.
Common gotcha
Developers compare full fine-tuning and LoRA training loss after 30 steps and declare LoRA the winner because it trains faster and uses less memory. Then they ship it and find the validation loss diverges badly at step 500: LoRA plateaus while full fine-tuning keeps improving. The gotcha is confusing convergence speed with final quality. LoRA trains faster but often needs more steps to reach the same loss floor, and on domain-specific tasks, that floor may never converge without the model parameters themselves changing.
Error recovery
RuntimeError: CUDA out of memoryValueError: target_modules not foundExperienced dev note
The real decision isn't about the cost of compute: it's about the cost of being wrong. Full fine-tuning loses you 2.4x speed and 2x memory. LoRA loses you 1-3 percentage points of task accuracy on narrow domains and locks you into a fixed base model. In production, if a 2% accuracy drop is worth $500/month in cheaper training, choose LoRA. If a 2% drop costs you $5M in customer churn or regulatory issues (legal AI, medical AI), full fine-tune without hesitation. Most teams optimize the wrong variable: they minimize training cost when they should minimize deployment risk.
Check your understanding
You're fine-tuning a 2B-parameter model on internal company code for autocomplete. Full fine-tuning takes 6 hours on an A100, LoRA takes 1.5 hours. Your validation loss after 100 steps is identical for both. Which should you choose for production, and why would the choice differ if your test set was general code from GitHub instead of internal code?
Show answer hint
A correct answer recognizes that convergence speed at early steps is misleading. The key insight is that LoRA hits a quality ceiling on narrow, specialized distributions (internal code is highly specialized) because rank-8 low-rank updates can't fully reshape the model's representation for out-of-distribution knowledge. On general distributions (GitHub code), the ceiling doesn't matter because the base model already knows that distribution well. The answer should also mention that 6 hours on an A100 is acceptable for production code quality, whereas 1.5 hours for LoRA might ship a model that fails on internal idioms.