Why LoRA trains a fraction of parameters but matches full fine-tuning
Why this matters
You'll run into memory and compute constraints in production. Understanding why LoRA achieves near-full-tuning quality with 1/10th the parameters helps you make informed trade-offs between speed, memory, and model performance: and know when to push back on architecture choices.
Explanation
What it is: LoRA (Low-Rank Adaptation) replaces the gradient-descent update to a weight matrix W with the product of two smaller matrices: W_new = W + A × B, where A and B are learned low-rank decompositions. This turns a dense weight update into a factorized one.
How it works mechanically: When you freeze the base model and train only A and B, you're not approximating the full update: you're learning a structured perturbation constrained to a low-rank subspace. Empirically, task-specific weight changes do occupy a low-rank manifold: most of the variation can be captured in 4-8 dimensions per layer (versus 4096+ in a typical dense matrix). The frozen base model's knowledge remains intact, and the adapter learns only the delta. This is why quality holds: the adapter isn't trying to learn the whole task from scratch, just the difference between the pretrained and fine-tuned behavior.
When to use it: Use LoRA when you're fine-tuning for a specific task (classification, instruction-following, domain adaptation) and memory or inference latency matter. It's the default choice for most production fine-tuning today because it's fast, parameter-efficient, and composable: you can swap different LoRA weights without reloading the base model.
Analogy
LoRA is like keeping a large, detailed map (the base model) and then applying small, focused corrections (the low-rank adapters) in specific regions. You're not redrawing the entire map: you're noting 'this valley is now a lake, this forest is now a city' with minimal ink. The corrections are structured, not scattered randomly.
Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
model_name = 'gpt2'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f'Base model parameters: {sum(p.numel() for p in model.parameters()):,}')
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=['c_attn', 'c_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
)
model_with_lora = get_peft_model(model, lora_config)
print(f'Model with LoRA parameters: {sum(p.numel() for p in model_with_lora.parameters()):,}')
print(f'Trainable parameters: {sum(p.numel() for p in model_with_lora.parameters() if p.requires_grad):,}')
trainable_fraction = sum(p.numel() for p in model_with_lora.parameters() if p.requires_grad) / sum(p.numel() for p in model.parameters())
print(f'Trainable fraction: {trainable_fraction:.2%}')
print('\nLoRA weight shapes:')
for name, param in model_with_lora.named_parameters():
if 'lora' in name and param.requires_grad:
print(f' {name}: {param.shape}') Base model parameters: 124,439,808 Model with LoRA parameters: 124,651,520 Trainable parameters: 211,712 Trainable fraction: 0.17% LoRA weight shapes: transformer.h.0.attn.c_attn.lora_A.weight: torch.Size([8, 768]) transformer.h.0.attn.c_attn.lora_B.weight: torch.Size([768, 8]) transformer.h.0.attn.c_proj.lora_A.weight: torch.Size([8, 768]) transformer.h.0.attn.c_proj.lora_B.weight: torch.Size([768, 8]) transformer.h.1.attn.c_attn.lora_A.weight: torch.Size([8, 768]) transformer.h.1.attn.c_attn.lora_B.weight: torch.Size([768, 8]) transformer.h.1.attn.c_proj.lora_A.weight: torch.Size([8, 768]) transformer.h.1.attn.c_proj.lora_B.weight: torch.Size([768, 8]) transformer.h.2.attn.c_attn.lora_A.weight: torch.Size([8, 768]) transformer.h.2.attn.c_attn.lora_B.weight: torch.Size([768, 8]) transformer.h.2.attn.c_proj.lora_A.weight: torch.Size([8, 768]) transformer.h.2.attn.c_proj.lora_B.weight: torch.Size([768, 8])
What just happened?
The code loaded GPT-2 and applied LoRA to its attention modules. The base model has 124M parameters, but only 211K are trainable (0.17%). Those 211K parameters are distributed as low-rank pairs: an 8×768 matrix (<code>lora_A</code>) and a 768×8 matrix (<code>lora_B</code>) inserted into each attention head. During training, only these matrices receive gradients. The frozen base model contributes zero trainable parameters but provides all the pretrained knowledge.
Common gotcha
Developers often set r (rank) too high thinking 'more rank = more capacity = better quality.' Wrong. If r=768 (the full dimension), you're back to training the full matrix and you've gained nothing except complexity. The sweet spot is r=4 to r=16 for most tasks: empirically verified across GPT, LLaMA, and Mistral. Going higher than 32 almost never helps and wastes memory. Start at r=8, measure quality, and lower if memory is tight.
Error recovery
ValueError: LoRA not supported for this module typeRuntimeError: Expected 2D tensor for LoRA weight matrixCUDA out of memory even with LoRAExperienced dev note
The reason LoRA works isn't mathematical magic: it's empirical. Researchers found that the weight change from pretraining to fine-tuning, for a given task, lives in a low-rank subspace. You're not approximating or sacrificing; you're exploiting task-specific structure. This has a production implication: you can stack multiple LoRA adapters on the same base model and switch them at inference with ~no overhead. Want a model that does summarization, translation, and Q&A without reloading? Stack three LoRAs, swap weights per request. Full fine-tuning doesn't let you do this: LoRA fundamentally changes how you can architect multi-task systems.
Check your understanding
If you set r=2 for a GPT-2 LoRA adapter on query and value projections, what fraction of GPT-2's 124M parameters become trainable, and why does the model still learn meaningful task behavior if the low-rank bottleneck is so tight?
Show answer hint
Calculate: For each projection, you add <code>2 × 768 + 768 × 2 = 3,072</code> parameters per head per layer. With ~12 layers and 12 heads, you're looking at ~0.03% trainable. The model learns because it's not learning the task from scratch: the 124M frozen parameters already encode general language. LoRA only needs to capture the task-specific delta, which is genuinely low-rank. Empirically, even <code>r=4</code> often matches <code>r=8</code> in downstream quality.