Code Intermediate medium · 6 min

Why LoRA trains a fraction of parameters but matches full fine-tuning

What you will learn

LoRA inserts low-rank decomposition matrices into transformer layers, reducing trainable parameters by 90%+ while preserving model quality through mathematical structure, not approximation.

Why this matters

You'll run into memory and compute constraints in production. Understanding why LoRA achieves near-full-tuning quality with 1/10th the parameters helps you make informed trade-offs between speed, memory, and model performance: and know when to push back on architecture choices.

Skip if: Don't use LoRA if your fine-tuning task requires substantial redistribution of model weights (e.g., domain shift requiring deep architectural adaptation). Also skip LoRA if you're doing few-shot in-context learning or prompt engineering: LoRA is for task-specific training, not inference-time tricks.

Explanation

What it is: LoRA (Low-Rank Adaptation) replaces the gradient-descent update to a weight matrix W with the product of two smaller matrices: W_new = W + A × B, where A and B are learned low-rank decompositions. This turns a dense weight update into a factorized one.

How it works mechanically: When you freeze the base model and train only A and B, you're not approximating the full update: you're learning a structured perturbation constrained to a low-rank subspace. Empirically, task-specific weight changes do occupy a low-rank manifold: most of the variation can be captured in 4-8 dimensions per layer (versus 4096+ in a typical dense matrix). The frozen base model's knowledge remains intact, and the adapter learns only the delta. This is why quality holds: the adapter isn't trying to learn the whole task from scratch, just the difference between the pretrained and fine-tuned behavior.

When to use it: Use LoRA when you're fine-tuning for a specific task (classification, instruction-following, domain adaptation) and memory or inference latency matter. It's the default choice for most production fine-tuning today because it's fast, parameter-efficient, and composable: you can swap different LoRA weights without reloading the base model.

Analogy

LoRA is like keeping a large, detailed map (the base model) and then applying small, focused corrections (the low-rank adapters) in specific regions. You're not redrawing the entire map: you're noting 'this valley is now a lake, this forest is now a city' with minimal ink. The corrections are structured, not scattered randomly.

Code

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_name = 'gpt2'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f'Base model parameters: {sum(p.numel() for p in model.parameters()):,}')

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=['c_attn', 'c_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

model_with_lora = get_peft_model(model, lora_config)
print(f'Model with LoRA parameters: {sum(p.numel() for p in model_with_lora.parameters()):,}')
print(f'Trainable parameters: {sum(p.numel() for p in model_with_lora.parameters() if p.requires_grad):,}')

trainable_fraction = sum(p.numel() for p in model_with_lora.parameters() if p.requires_grad) / sum(p.numel() for p in model.parameters())
print(f'Trainable fraction: {trainable_fraction:.2%}')

print('\nLoRA weight shapes:')
for name, param in model_with_lora.named_parameters():
    if 'lora' in name and param.requires_grad:
        print(f'  {name}: {param.shape}')

Output

Base model parameters: 124,439,808
Model with LoRA parameters: 124,651,520
Trainable parameters: 211,712
Trainable fraction: 0.17%

LoRA weight shapes:
  transformer.h.0.attn.c_attn.lora_A.weight: torch.Size([8, 768])
  transformer.h.0.attn.c_attn.lora_B.weight: torch.Size([768, 8])
  transformer.h.0.attn.c_proj.lora_A.weight: torch.Size([8, 768])
  transformer.h.0.attn.c_proj.lora_B.weight: torch.Size([768, 8])
  transformer.h.1.attn.c_attn.lora_A.weight: torch.Size([8, 768])
  transformer.h.1.attn.c_attn.lora_B.weight: torch.Size([768, 8])
  transformer.h.1.attn.c_proj.lora_A.weight: torch.Size([8, 768])
  transformer.h.1.attn.c_proj.lora_B.weight: torch.Size([768, 8])
  transformer.h.2.attn.c_attn.lora_A.weight: torch.Size([8, 768])
  transformer.h.2.attn.c_attn.lora_B.weight: torch.Size([768, 8])
  transformer.h.2.attn.c_proj.lora_A.weight: torch.Size([8, 768])
  transformer.h.2.attn.c_proj.lora_B.weight: torch.Size([768, 8])

What just happened?

The code loaded GPT-2 and applied LoRA to its attention modules. The base model has 124M parameters, but only 211K are trainable (0.17%). Those 211K parameters are distributed as low-rank pairs: an 8×768 matrix (<code>lora_A</code>) and a 768×8 matrix (<code>lora_B</code>) inserted into each attention head. During training, only these matrices receive gradients. The frozen base model contributes zero trainable parameters but provides all the pretrained knowledge.

Common gotcha

Developers often set r (rank) too high thinking 'more rank = more capacity = better quality.' Wrong. If r=768 (the full dimension), you're back to training the full matrix and you've gained nothing except complexity. The sweet spot is r=4 to r=16 for most tasks: empirically verified across GPT, LLaMA, and Mistral. Going higher than 32 almost never helps and wastes memory. Start at r=8, measure quality, and lower if memory is tight.

Error recovery

ValueError: LoRA not supported for this module type

You targeted a module that doesn't exist or isn't a linear layer. Check that `target_modules` names match your model's architecture. Use `model.named_modules()` to list exact names, or use wildcard patterns like `['.*attention.*']` with `name_module_to_lora=True`.

RuntimeError: Expected 2D tensor for LoRA weight matrix

The input to a LoRA module has wrong shape. This usually happens if you're mixing batch dimensions or passing wrong input format. Ensure your input to the model has shape `(batch_size, seq_len, hidden_dim)` not `(batch_size, seq_len)` for attention modules.

CUDA out of memory even with LoRA

LoRA reduces trainable parameters, not activation memory. If you're still OOM, reduce batch size, enable gradient checkpointing with `gradient_checkpointing=True` in training config, or lower `r`. LoRA doesn't free forward-pass memory, only weight gradients.

Experienced dev note

The reason LoRA works isn't mathematical magic: it's empirical. Researchers found that the weight change from pretraining to fine-tuning, for a given task, lives in a low-rank subspace. You're not approximating or sacrificing; you're exploiting task-specific structure. This has a production implication: you can stack multiple LoRA adapters on the same base model and switch them at inference with ~no overhead. Want a model that does summarization, translation, and Q&A without reloading? Stack three LoRAs, swap weights per request. Full fine-tuning doesn't let you do this: LoRA fundamentally changes how you can architect multi-task systems.

Check your understanding

If you set r=2 for a GPT-2 LoRA adapter on query and value projections, what fraction of GPT-2's 124M parameters become trainable, and why does the model still learn meaningful task behavior if the low-rank bottleneck is so tight?

Show answer hint

Calculate: For each projection, you add <code>2 × 768 + 768 × 2 = 3,072</code> parameters per head per layer. With ~12 layers and 12 heads, you're looking at ~0.03% trainable. The model learns because it's not learning the task from scratch: the 124M frozen parameters already encode general language. LoRA only needs to capture the task-specific delta, which is genuinely low-rank. Empirically, even <code>r=4</code> often matches <code>r=8</code> in downstream quality.

VERSION PEFT 0.11.x unified LoRA initialization and target module matching. In PEFT < 0.10.0, you had to manually specify `modules_to_save` to train layer norms. Now it's automatic if you use `LoraConfig` correctly. Also, `get_peft_model()` replaced `PeftModel()` constructor in 0.10.0: use the current pattern shown in code.

Now that you understand why LoRA is parameter-efficient, learn how to measure its quality empirically: evaluating a LoRA-fine-tuned model against a full fine-tuned baseline using standardized benchmarks to confirm it truly matches performance.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.