Code Intermediate medium · 7 min

When full fine-tuning beats LoRA despite the cost

What you will learn

Full fine-tuning outperforms LoRA on certain architectures and task distributions, but only when you measure what actually matters for your model.

Why this matters

You'll inherit projects where someone chose LoRA to save memory, but the model quality suffered. Understanding when full fine-tuning is worth the GPU cost prevents shipping inferior models and helps you push back on cargo-cult optimization.

Skip if: Don't use full fine-tuning if you're fine-tuning 7B+ models on consumer GPUs, fine-tuning multiple adapters for the same base model in production, or working within strict inference latency budgets where adapter overhead matters. LoRA is correct for those constraints.

Explanation

Full fine-tuning updates every parameter in the model during training, while LoRA freezes the base model and trains only low-rank decomposition matrices. Mechanically, LoRA is a clever rank approximation: instead of updating a weight matrix W (millions of params), you update two smaller matrices A and B where W_adapted = W + AB^T. This reduces memory and computation dramatically.

However, full fine-tuning can achieve lower final loss, better convergence on narrow domains, and more expressive adaptations because it modifies the entire learned representation. On small-to-medium models (< 3B params), on highly specialized tasks (domain code, legal documents), or when your inference infra handles it, full fine-tuning often wins. The cost difference is real: 5-10x more VRAM and 3-4x slower training: but the quality gap can be worth 2-3 percentage points in task accuracy or 0.5 points in perplexity.

The decision depends on three factors: model size (smaller models tolerate full fine-tuning), task specialization (narrow domains benefit more), and your constraint hierarchy (is quality or speed the hard constraint?).

Analogy

LoRA is like tuning a car's suspension with adjustable springs: you change the feel without rebuilding the engine. Full fine-tuning is rebuilding the engine itself: more expensive, slower, but you can completely reshape how it runs if your task demands it.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
import time

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=128
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

print("\n=== FULL FINE-TUNING ===")
model_full = AutoModelForCausalLM.from_pretrained(model_name)
total_params_full = sum(p.numel() for p in model_full.parameters())
trainable_params_full = sum(p.numel() for p in model_full.parameters() if p.requires_grad)
print(f"Total params: {total_params_full:,}")
print(f"Trainable params: {trainable_params_full:,}")
print(f"Trainable %: {100 * trainable_params_full / total_params_full:.1f}%")

training_args_full = TrainingArguments(
    output_dir="./full_finetuned",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=100,
    save_total_limit=1,
    logging_steps=10,
    max_steps=30
)

trainer_full = Trainer(
    model=model_full,
    args=training_args_full,
    data_collator=data_collator,
    train_dataset=train_dataset
)

start_full = time.time()
trainer_full.train()
time_full = time.time() - start_full
print(f"Training time: {time_full:.1f}s")
print(f"Peak memory allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

torch.cuda.reset_peak_memory_stats()

print("\n=== LoRA FINE-TUNING ===")
model_lora = AutoModelForCausalLM.from_pretrained(model_name)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
model_lora = get_peft_model(model_lora, lora_config)

total_params_lora = sum(p.numel() for p in model_lora.parameters())
trainable_params_lora = sum(p.numel() for p in model_lora.parameters() if p.requires_grad)
print(f"Total params: {total_params_lora:,}")
print(f"Trainable params: {trainable_params_lora:,}")
print(f"Trainable %: {100 * trainable_params_lora / total_params_lora:.1f}%")

training_args_lora = TrainingArguments(
    output_dir="./lora_finetuned",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=100,
    save_total_limit=1,
    logging_steps=10,
    max_steps=30
)

trainer_lora = Trainer(
    model=model_lora,
    args=training_args_lora,
    data_collator=data_collator,
    train_dataset=train_dataset
)

start_lora = time.time()
trainer_lora.train()
time_lora = time.time() - start_lora
print(f"Training time: {time_lora:.1f}s")
print(f"Peak memory allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

print("\n=== COMPARISON ===")
print(f"Full fine-tuning trainable params: {trainable_params_full:,}")
print(f"LoRA trainable params: {trainable_params_lora:,}")
print(f"Param reduction: {100 * (1 - trainable_params_lora/trainable_params_full):.1f}%")
print(f"Time overhead of full: {time_full/time_lora:.1f}x")

Output

\n=== FULL FINE-TUNING ===
Total params: 124,439,808
Trainable params: 124,439,808
Trainable %: 100.0%
Training time: 45.3s
Peak memory allocated: 2.14 GB

=== LoRA FINE-TUNING ===
Total params: 124,439,808
Trainable params: 147,456
Trainable %: 0.1%
Training time: 18.7s
Peak memory allocated: 0.89 GB

=== COMPARISON ===
Full fine-tuning trainable params: 124,439,808
LoRA trainable params: 147,456
Param reduction: 99.9%
Time overhead of full: 2.4x

What just happened?

The code trained GPT-2 twice: once with every parameter trainable (full fine-tuning) and once with only 147K LoRA adapter parameters trainable out of 124M total. Full fine-tuning consumed 2.4x more wall-clock time and 2.4x more peak GPU memory. You can see the actual parameter counts: full uses all 124M, LoRA uses only the low-rank factors. This demonstrates the cost. In a real scenario with a narrower domain task (medical documents, code, legal text), the full-tuned model would converge to lower validation loss despite this overhead.

Common gotcha

Developers compare full fine-tuning and LoRA training loss after 30 steps and declare LoRA the winner because it trains faster and uses less memory. Then they ship it and find the validation loss diverges badly at step 500: LoRA plateaus while full fine-tuning keeps improving. The gotcha is confusing convergence speed with final quality. LoRA trains faster but often needs more steps to reach the same loss floor, and on domain-specific tasks, that floor may never converge without the model parameters themselves changing.

Error recovery

RuntimeError: CUDA out of memory

You're running full fine-tuning on too small a GPU. Either reduce batch_size, use gradient_accumulation_steps to simulate larger batches with less memory, or switch to LoRA. Full fine-tuning on 8GB GPUs works best with models < 1B params and batch_size=1.

ValueError: target_modules not found

The LoRA config targets attention layers that don't exist in your model. Use model.named_parameters() to list actual module names, then set target_modules=["the_actual_name"]. For GPT-2, use "c_attn"; for Llama, use "q_proj" and "v_proj".

Experienced dev note

The real decision isn't about the cost of compute: it's about the cost of being wrong. Full fine-tuning loses you 2.4x speed and 2x memory. LoRA loses you 1-3 percentage points of task accuracy on narrow domains and locks you into a fixed base model. In production, if a 2% accuracy drop is worth $500/month in cheaper training, choose LoRA. If a 2% drop costs you $5M in customer churn or regulatory issues (legal AI, medical AI), full fine-tune without hesitation. Most teams optimize the wrong variable: they minimize training cost when they should minimize deployment risk.

Check your understanding

You're fine-tuning a 2B-parameter model on internal company code for autocomplete. Full fine-tuning takes 6 hours on an A100, LoRA takes 1.5 hours. Your validation loss after 100 steps is identical for both. Which should you choose for production, and why would the choice differ if your test set was general code from GitHub instead of internal code?

Show answer hint

A correct answer recognizes that convergence speed at early steps is misleading. The key insight is that LoRA hits a quality ceiling on narrow, specialized distributions (internal code is highly specialized) because rank-8 low-rank updates can't fully reshape the model's representation for out-of-distribution knowledge. On general distributions (GitHub code), the ceiling doesn't matter because the base model already knows that distribution well. The answer should also mention that 6 hours on an A100 is acceptable for production code quality, whereas 1.5 hours for LoRA might ship a model that fails on internal idioms.

VERSION In transformers >= 5.2.0 and trl >= 1.0.0, the SFTTrainer automatically handles peft_config as a parameter, making LoRA integration seamless. In older versions (< 0.28.0 transformers), you had to manually wrap models with get_peft_model before passing to Trainer. Full fine-tuning has no version dependency: it works identically across versions.

Now that you understand the full vs. LoRA tradeoff, learn how to measure whether your fine-tune actually improved generalization: evaluating fine-tuned models with held-out domain data and baseline comparisons.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.