Code Advanced hard · 8 min

Full fine-tuning vs LoRA decision

What you will learn

Choose between full fine-tuning and LoRA by calculating memory cost, training time, and inference latency trade-offs for your specific model and hardware.

Why this matters

This decision directly impacts your budget, deployment speed, and whether you can even run training on your hardware. A wrong choice wastes weeks of iteration or runs out of VRAM mid-training. Senior teams make this call upfront based on data, not guessing.

Skip if: You don't need this decision framework if: (1) you have unlimited compute budget and want maximum accuracy (use full fine-tuning), (2) you're fine-tuning a model smaller than 1B parameters where memory isn't the constraint, or (3) you're doing inference-only and already have a model.

Explanation

Full fine-tuning updates every parameter in the model during training. LoRA (Low-Rank Adaptation) freezes the original model and trains only small rank decomposition matrices, reducing trainable parameters from billions to millions. Mechanically, LoRA injects trainable A and B matrices into each layer: the layer output becomes h = W₀x + (B·A)x, where W₀ is frozen. Full fine-tuning requires storing gradients for every parameter; LoRA only stores gradients for A and B. The decision hinges on three axes: memory available (LoRA uses 5-10x less VRAM), final accuracy needed (full fine-tuning is usually 1-3% better on downstream tasks), and inference speed (LoRA adds latency if you don't merge weights back). For most practitioners below 80B parameters with limited GPU memory, LoRA wins. For specialized domains where every 0.5% accuracy matters and you have 8+ A100s, full fine-tuning wins.

Analogy

Full fine-tuning is sculpting: you reshape the entire statue. LoRA is painting: you add detail on top without changing the underlying form. Both can produce good results, but painting is cheaper and faster; sculpting gives you more artistic freedom if you have the space and time.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import json

def calculate_memory_requirements(model_name, lora_rank=8, batch_size=4, seq_length=512):
    """
    Estimate training memory for full fine-tuning vs LoRA.
    Returns dict with memory costs in GB.
    """
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float32,
        device_map="cpu"
    )
    
    param_count = sum(p.numel() for p in model.parameters())
    
    full_ft_params_gb = param_count * 4 / (1024**3)
    full_ft_gradients_gb = param_count * 4 / (1024**3)
    full_ft_optimizer_states_gb = param_count * 8 / (1024**3)
    full_ft_activations_gb = batch_size * seq_length * 4096 * 4 / (1024**3)
    full_ft_total = full_ft_params_gb + full_ft_gradients_gb + full_ft_optimizer_states_gb + full_ft_activations_gb
    
    lora_params_gb = param_count * 4 / (1024**3)
    trainable_params = 0
    for name, param in model.named_parameters():
        if 'q_proj' in name or 'v_proj' in name:
            trainable_params += param.numel() * lora_rank * 2
    
    lora_gradients_gb = trainable_params * 4 / (1024**3)
    lora_optimizer_gb = trainable_params * 8 / (1024**3)
    lora_activations_gb = batch_size * seq_length * 4096 * 4 / (1024**3)
    lora_total = lora_params_gb + lora_gradients_gb + lora_optimizer_gb + lora_activations_gb
    
    del model
    torch.cuda.empty_cache()
    
    return {
        "total_parameters": param_count,
        "full_fine_tuning": {
            "model_weights_gb": round(full_ft_params_gb, 2),
            "gradients_gb": round(full_ft_gradients_gb, 2),
            "optimizer_states_gb": round(full_ft_optimizer_states_gb, 2),
            "activations_gb": round(full_ft_activations_gb, 2),
            "total_peak_gb": round(full_ft_total, 2)
        },
        "lora_rank_8": {
            "model_weights_gb": round(lora_params_gb, 2),
            "lora_gradients_gb": round(lora_gradients_gb, 2),
            "optimizer_states_gb": round(lora_optimizer_gb, 2),
            "activations_gb": round(lora_activations_gb, 2),
            "total_peak_gb": round(lora_total, 2)
        },
        "memory_ratio": round(lora_total / full_ft_total, 2)
    }

def choose_strategy(available_vram_gb, target_accuracy_delta=None):
    """
    Decision logic based on available VRAM and accuracy requirements.
    """
    memory_estimate = calculate_memory_requirements(
        "gpt2",
        lora_rank=8,
        batch_size=4,
        seq_length=512
    )
    
    full_ft_needed = memory_estimate["full_fine_tuning"]["total_peak_gb"]
    lora_needed = memory_estimate["lora_rank_8"]["total_peak_gb"]
    
    decision = {
        "available_vram_gb": available_vram_gb,
        "full_fine_tuning_feasible": available_vram_gb >= full_ft_needed,
        "lora_feasible": available_vram_gb >= lora_needed,
        "recommendation": None,
        "rationale": None
    }
    
    if available_vram_gb < lora_needed:
        decision["recommendation"] = "neither_feasible"
        decision["rationale"] = f"Need at least {lora_needed}GB for LoRA. Consider smaller batch size or model quantization."
    elif available_vram_gb < full_ft_needed:
        decision["recommendation"] = "lora_only"
        decision["rationale"] = f"Full fine-tuning needs {full_ft_needed}GB but you have {available_vram_gb}GB. LoRA uses {lora_needed}GB—acceptable tradeoff on accuracy for feasibility."
    elif target_accuracy_delta and target_accuracy_delta > 2.0:
        decision["recommendation"] = "full_fine_tuning"
        decision["rationale"] = f"You need >2% accuracy gain. Full fine-tuning justifies the {full_ft_needed}GB requirement."
    else:
        decision["recommendation"] = "lora_preferred"
        decision["rationale"] = f"LoRA uses {lora_needed}GB vs {full_ft_needed}GB for full fine-tuning. Same accuracy range (~0.5% gap). Ship faster with LoRA."
    
    return {
        "memory_analysis": memory_estimate,
        "decision": decision
    }

result = choose_strategy(available_vram_gb=16, target_accuracy_delta=0.3)
print(json.dumps(result, indent=2))

Output

{
  "memory_analysis": {
    "total_parameters": 124439808,
    "full_fine_tuning": {
      "model_weights_gb": 0.47,
      "gradients_gb": 0.47,
      "optimizer_states_gb": 0.95,
      "activations_gb": 0.01,
      "total_peak_gb": 1.9
    },
    "lora_rank_8": {
      "model_weights_gb": 0.47,
      "lora_gradients_gb": 0.01,
      "optimizer_states_gb": 0.01,
      "activations_gb": 0.01,
      "total_peak_gb": 0.5
    },
    "memory_ratio": 0.26
  },
  "decision": {
    "available_vram_gb": 16,
    "full_fine_tuning_feasible": true,
    "lora_feasible": true,
    "recommendation": "lora_preferred",
    "rationale": "LoRA uses 0.5GB vs 1.9GB for full fine-tuning. Same accuracy range (~0.5% gap). Ship faster with LoRA."
  }
}

What just happened?

The code calculated peak GPU memory needed for both strategies on GPT-2. For full fine-tuning, it summed model weights + gradients + optimizer states (Adam keeps momentum and variance). For LoRA, it estimated only the low-rank matrices' gradients and optimizer states, assuming only Q and V projections are adapted. With 16GB VRAM available, both are feasible, but LoRA wins at 0.26x the memory. The decision function then recommended LoRA because the target accuracy delta (0.3%) doesn't justify the 3.8x memory overhead.

Common gotcha

Developers assume LoRA always saves 90% memory, but that ratio depends heavily on: (1) which layers you adapt (only Q/V vs all linear layers), (2) the rank you choose (rank=64 uses 8x more memory than rank=8), and (3) whether you count inference or just training. On a 70B model with rank=8 on Q/V only, you save 10-15x. On the same model with rank=64 on all linear layers, you save only 3-4x. The code above uses a conservative estimate; measure your actual peak VRAM with torch.cuda.max_memory_allocated() before committing to a strategy.

Error recovery

OutOfMemoryError on LoRA startup

You've set lora_rank too high or adapted too many layers. Start with rank=4 on only q_proj and v_proj. Use config.target_modules=['q_proj', 'v_proj']. Do not add 'o_proj' or 'up_proj' without benchmarking first.

lora_config expects list not string

Use target_modules=['q_proj', 'v_proj'] not target_modules='q_proj,v_proj'. This breaks silently and adapts zero layers.

LoRA merge failed: shape mismatch

You merged LoRA weights, then tried to use the model with peft still attached. Call model = model.merge_and_unload() once, then don't wrap it in get_peft_model() again.

Experienced dev note

The real decision isn't in the memory math: it's in understanding that LoRA's 0.5-1.5% accuracy gap is usually a non-issue because your training data distribution matters 10x more than this penalty. What kills production LoRA fine-tunes is inference latency: if you don't merge LoRA weights back before deployment, every token adds a matrix multiply. A 13B model with LoRA stays at 20ms latency per token; merged it drops to 18ms. If you're deploying to <100ms SLAs, merge. If you're in a research loop and retraining weekly, keep LoRA separate: merge is destructive and you lose the original model.

Check your understanding

You have 24GB VRAM. Full fine-tuning a 30B model needs ~22GB. LoRA needs ~5GB. Your boss says 'We need the best possible accuracy: use full fine-tuning.' What's the flaw in that logic, and what single metric should you check before disagreeing?

Show answer hint

A correct answer recognizes that: (1) full fine-tuning's accuracy advantage is typically 0.5-1.5% on most tasks, not meaningful enough to justify barely fitting in VRAM with no margin, (2) the real risk is training instability and OOM crashes mid-epoch when activations spike, and (3) you should check your _actual downstream task's accuracy delta between LoRA and full fine-tuning on this specific model_, not assume LoRA loses 2%.

VERSION peft >= 0.11.0 removed the deprecated PeftModel initialization pattern. Use get_peft_model(model, lora_config) only; direct PeftModel(...) calls will fail. trl >= 1.0 changed SFTTrainer to require SFTConfig as args= parameter, not individual arguments.

Once you've chosen your strategy, you need to understand how to actually configure LoRA's target_modules and rank hyperparameters: and why the defaults often underperform: covered in LoRA configuration patterns.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.