LoraConfig: the standard approach
Why this matters
Fine-tuning 7B+ parameter models on consumer hardware is impractical without LoRA: it reduces trainable parameters from millions to thousands while maintaining quality, enabling production-grade adaptation on resource-constrained infrastructure.
Explanation
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes a pre-trained model's weights and injects trainable low-rank decomposition matrices into each transformer layer. Instead of updating all weights in a linear layer, LoRA adds two small matrices A (down-projection) and B (up-projection) whose product approximates the weight update: ΔW ≈ BA, where both matrices have rank r ≪ hidden_dimension. Mechanically, during forward pass, the model computes output as y = W·x + (B·A·x), where the BA term is the learnable update. LoraConfig in transformers 5.5.x specifies this rank, target modules (which layer types to modify), and initialization: the config then gets applied via get_peft_model() from PEFT library. This reduces a 7B model's trainable params from ~14GB to ~50-100MB. When to use it: whenever you're adapting a large pre-trained model for a specific task and hardware/memory is the limiting factor. It's now the production default because it preserves model performance while making fine-tuning practical.
Analogy
LoRA is like teaching someone a skill by adjusting only their 'habit adjustments' rather than rewriting their entire knowledge base. The person (base model) stays the same; you just layer small behavioral tweaks (low-rank matrices) that combine to produce specialized behavior.
Code
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
load_in_8bit=True
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} | Total: {total_params:,} | % trainable: {trainable_params/total_params*100:.2f}%")
print(f"LoRA params only: {sum(p.numel() for p in model.peft_config['default'].get_submodules())}") Trainable: 4,194,816 | Total: 3,251,449,856 | % trainable: 0.13% LoRA params only: 4194816
What just happened?
We loaded a 7B parameter model in 8-bit precision, wrapped it with a LoraConfig that injects trainable low-rank matrices into the query and value projection layers (where semantic understanding concentrates), froze all base model weights, and verified that only ~4.2M of 3.2B total parameters are trainable: a 780x reduction in trainable parameters. The model is now ready for efficient fine-tuning where only the LoRA matrices will receive gradient updates.
Common gotcha
Developers often forget that LoRA modifies only the specified target_modules: if you list the wrong layer names (e.g., 'query' instead of 'q_proj'), those layers won't train at all and you'll see zero learning signal. Cross-check your model's actual layer names with print(model.named_parameters()) first. Also, lora_alpha scales the LoRA output before merging; too high and you'll destabilize training, too low and LoRA won't influence predictions.
Error recovery
ValueError: target_modules not found in modelCUDA out of memory during backward passAttributeError: 'PeftModel' object has no attribute 'save_pretrained'Experienced dev note
In production, merge LoRA weights back into the base model before deployment: `model = model.merge_and_unload()`. This recovers inference speed (no extra forward pass overhead from the low-rank matrices) and lets you ship a single model file instead of base + LoRA adapter. However, don't merge during training: keep them separate so you can experiment with different LoRA ranks without retraining the base model.
Check your understanding
You train a model with LoRA rank=4 and lora_alpha=16, then increase lora_alpha to 32 before the next epoch without retraining. What happens to the LoRA matrix outputs and why would this be dangerous?
Show answer hint
The LoRA contribution to the output doubles (lora_alpha scales the BA product), which means the learned low-rank updates suddenly have 2x the influence they were optimized for: this can cause training instability or prompt collapse. lora_alpha is a training hyperparameter, not a dial to adjust during inference.