How does LoRA work
LoRA is like adding small, adjustable lenses to a large camera instead of rebuilding the entire camera to change how it captures images — it tweaks the output efficiently without overhauling the whole system.
The core mechanism
LoRA works by freezing the original pretrained model weights and learning only small low-rank matrices that approximate the weight updates needed for a new task. Instead of updating the full weight matrix W of size d \times k, LoRA learns two smaller matrices A and B of sizes d \times r and r \times k respectively, where r is much smaller than d and k. The adapted weight becomes W + BA, where BA is a low-rank update.
This reduces the number of trainable parameters from d \times k to r(d + k), often a fraction of the original size, making fine-tuning faster and more memory efficient.
Step by step
1. Start with a pretrained model with frozen weights W.
2. Initialize two small trainable matrices A and B with rank r (e.g., 4 or 8).
3. During training, only A and B are updated; W remains fixed.
4. The effective weight used in forward passes is W + BA.
5. After training, save only A and B as the fine-tuned parameters.
| Step | Action | Details |
|---|---|---|
| 1 | Freeze original weights | Keep pretrained weights W fixed |
| 2 | Initialize low-rank matrices | Create A and B with rank r |
| 3 | Train only A and B | Update these small matrices during fine-tuning |
| 4 | Compute adapted weights | Use W + BA for forward passes |
| 5 | Save low-rank updates | Store only A and B for deployment |
Concrete example
Suppose a weight matrix W is 1024 × 1024 (over 1 million parameters). Using LoRA with rank r = 8, you train two matrices: A (1024 × 8 = 8,192 params) and B (8 × 1024 = 8,192 params), totaling 16,384 trainable parameters — less than 2% of the original.
This drastically reduces memory and compute during fine-tuning.
import os
import torch
from torch import nn
class LoRALayer(nn.Module):
def __init__(self, original_weight, rank=8):
super().__init__()
self.W = original_weight # frozen
d, k = original_weight.shape
self.A = nn.Parameter(torch.zeros(d, rank))
self.B = nn.Parameter(torch.zeros(rank, k))
nn.init.kaiming_uniform_(self.A, a=5**0.5)
nn.init.zeros_(self.B)
def forward(self, x):
# x @ (W + B @ A)
return x @ (self.W + self.B @ self.A)
# Example usage
original_W = torch.randn(1024, 1024)
original_W.requires_grad = False
lora_layer = LoRALayer(original_W, rank=8)
input_tensor = torch.randn(1, 1024)
output = lora_layer(input_tensor)
print(output.shape) torch.Size([1, 1024])
Common misconceptions
Many think LoRA fine-tunes the entire model weights, but it only trains small low-rank matrices, keeping the original weights frozen. This means LoRA is not full fine-tuning but an efficient parameter-efficient tuning method.
Another misconception is that LoRA reduces model capacity; in reality, it preserves the pretrained knowledge and adapts it with minimal overhead.
Why it matters for building AI apps
LoRA enables developers to fine-tune large models on limited hardware by drastically reducing trainable parameters and memory use. This lowers costs and speeds up experimentation, making it practical to customize models for specific tasks or domains without expensive full fine-tuning.
It also simplifies deployment since only small low-rank matrices need to be stored and loaded alongside the frozen base model.
Key Takeaways
- LoRA fine-tunes large models by training small low-rank matrices added to frozen weights.
- It reduces trainable parameters by over 90%, enabling efficient adaptation on limited hardware.
- Only the low-rank matrices are saved and deployed, simplifying model updates and storage.