Explained Intermediate · 3 min read

How does LoRA work

Quick answer

LoRA (Low-Rank Adaptation) fine-tunes large language models by freezing the original weights and injecting trainable low-rank matrices into each layer's weights, enabling efficient adaptation with fewer parameters. This approach drastically reduces compute and memory requirements compared to full fine-tuning while maintaining model performance.

💡

LoRA is like adding small, adjustable lenses to a large camera instead of rebuilding the entire camera to change how it captures images — it tweaks the output efficiently without overhauling the whole system.

The core mechanism

LoRA works by freezing the original pretrained model weights and learning only small low-rank matrices that approximate the weight updates needed for a new task. Instead of updating the full weight matrix W of size d \times k, LoRA learns two smaller matrices A and B of sizes d \times r and r \times k respectively, where r is much smaller than d and k. The adapted weight becomes W + BA, where BA is a low-rank update.

This reduces the number of trainable parameters from d \times k to r(d + k), often a fraction of the original size, making fine-tuning faster and more memory efficient.

Step by step

1. Start with a pretrained model with frozen weights W.

2. Initialize two small trainable matrices A and B with rank r (e.g., 4 or 8).

3. During training, only A and B are updated; W remains fixed.

4. The effective weight used in forward passes is W + BA.

5. After training, save only A and B as the fine-tuned parameters.

Step	Action	Details
1	Freeze original weights	Keep pretrained weights `W` fixed
2	Initialize low-rank matrices	Create `A` and `B` with rank `r`
3	Train only `A` and `B`	Update these small matrices during fine-tuning
4	Compute adapted weights	Use `W + BA` for forward passes
5	Save low-rank updates	Store only `A` and `B` for deployment

Concrete example

Suppose a weight matrix W is 1024 × 1024 (over 1 million parameters). Using LoRA with rank r = 8, you train two matrices: A (1024 × 8 = 8,192 params) and B (8 × 1024 = 8,192 params), totaling 16,384 trainable parameters — less than 2% of the original.

This drastically reduces memory and compute during fine-tuning.

python

import os
import torch
from torch import nn

class LoRALayer(nn.Module):
    def __init__(self, original_weight, rank=8):
        super().__init__()
        self.W = original_weight  # frozen
        d, k = original_weight.shape
        self.A = nn.Parameter(torch.zeros(d, rank))
        self.B = nn.Parameter(torch.zeros(rank, k))
        nn.init.kaiming_uniform_(self.A, a=5**0.5)
        nn.init.zeros_(self.B)

    def forward(self, x):
        # x @ (W + B @ A)
        return x @ (self.W + self.B @ self.A)

# Example usage
original_W = torch.randn(1024, 1024)
original_W.requires_grad = False
lora_layer = LoRALayer(original_W, rank=8)

input_tensor = torch.randn(1, 1024)
output = lora_layer(input_tensor)
print(output.shape)

output

torch.Size([1, 1024])

Common misconceptions

Many think LoRA fine-tunes the entire model weights, but it only trains small low-rank matrices, keeping the original weights frozen. This means LoRA is not full fine-tuning but an efficient parameter-efficient tuning method.

Another misconception is that LoRA reduces model capacity; in reality, it preserves the pretrained knowledge and adapts it with minimal overhead.

Why it matters for building AI apps

LoRA enables developers to fine-tune large models on limited hardware by drastically reducing trainable parameters and memory use. This lowers costs and speeds up experimentation, making it practical to customize models for specific tasks or domains without expensive full fine-tuning.

It also simplifies deployment since only small low-rank matrices need to be stored and loaded alongside the frozen base model.

✅

Key Takeaways

LoRA fine-tunes large models by training small low-rank matrices added to frozen weights.
It reduces trainable parameters by over 90%, enabling efficient adaptation on limited hardware.
Only the low-rank matrices are saved and deployed, simplifying model updates and storage.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗