Rank r: the key parameter
Why this matters
Rank r is the primary dial that determines your memory footprint, training speed, and whether the fine-tune can actually capture the task-specific patterns. A rank that's too conservative will produce a model that ignores your training data; a rank that's too aggressive wastes the entire point of LoRA and consumes 10–100x more VRAM.
Explanation
What rank r means: LoRA adds low-rank decomposition matrices to each linear layer. Rank r is the inner dimension of those matrices. A rank of 8 means each weight matrix W gets a A (d_in × 8) and B (8 × d_out) pair, where the product BA approximates the weight delta. Higher rank = more expressiveness, but also more parameters and memory. How to choose: Start with r=8 for most tasks; increase to 16 or 32 if the task is complex (classification, domain shift) or your model is very large (70B+). For small models or simple tasks (instruction-following on similar domain), r=4 often suffices. The rule of thumb: total LoRA params should be 0.1–1% of the model's full parameter count. What to watch: Rank interacts with alpha (the scaling factor). If you set alpha=2*r (the common heuristic), monitor whether training loss plateaus or oscillates: that's a sign rank is insufficient.
Code
# pip install peft bitsandbytes torch transformers
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model_name = 'meta-llama/Llama-2-7b-hf'
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map='auto'
)
ranks_to_test = [4, 8, 16]
configs = {}
for r in ranks_to_test:
lora_config = LoraConfig(
r=r,
lora_alpha=2 * r,
target_modules=['q_proj', 'v_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
)
configs[f'rank_{r}'] = lora_config
peft_model = get_peft_model(model, lora_config)
trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in peft_model.parameters())
percent = 100.0 * trainable_params / total_params
print(f'Rank {r}: {trainable_params:,} trainable params / {total_params:,} total ({percent:.3f}%)') Rank 4: 2,097,152 trainable params / 3,274,190,848 total (0.064%) Rank 8: 4,194,304 trainable params / 3,274,190,848 total (0.128%) Rank 16: 8,388,608 trainable params / 3,274,190,848 total (0.256%)
Your options
r=4
Small models (7B), very similar domain (e.g., fine-tuning a code LLM on more code), or when VRAM is critical and you accept some accuracy trade-off.
Pros
Minimal memory overhead (~0.4% params added for 7B model). Fast training. Inference latency nearly unchanged.
Cons
May underfit on diverse tasks. High-dimensional semantic shifts (language to code) might compress poorly into rank-4 subspace.
from peft import LoraConfig
lora_config = LoraConfig(
r=4,
lora_alpha=8,
target_modules=['q_proj', 'v_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
) r=8
Default for most tasks. Medium models (13B–30B), standard fine-tuning (instruction-tuning, domain adaptation within same modality).
Pros
Empirically best balance for typical fine-tuning. Widely benchmarked. Usually sufficient for 1–10% training data.
Cons
Not universally optimal: some very large models (70B) or very different tasks (vision→language) may need higher rank.
from peft import LoraConfig
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=['q_proj', 'v_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
) r=16 or r=32
Large models (70B+), complex task (e.g., code-to-math reasoning, cross-modal alignment), or <strong>if baseline r=8 plateaus</strong> during validation.
Pros
Captures more task-specific structure. Better for out-of-distribution generalization.
Cons
Significantly higher memory (4x more params at r=32 vs r=8). Training slower. Risk of overfitting on small datasets.
from peft import LoraConfig
lora_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=['q_proj', 'v_proj', 'k_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
) Validation step
<strong>Concrete checks before proceeding:</strong> (1) After instantiating your LoraConfig with your chosen rank, compute the total LoRA parameter count (shown in the code above) and verify it's in the range 0.05–1% of your base model's parameters. (2) Before training, print the LoRA config via <code>print(lora_config)</code> and confirm r matches your intended value and alpha = 2*r (or your intentional override). (3) After 1–2 training steps, check that <code>model.parameters()</code> includes tensors of shape <code>(d, r)</code> and <code>(r, d_out)</code> in the q_proj and v_proj layers: not full-rank weight matrices.
At scale
At production scale (100B+ model, distributed training): Rank r interacts nonlinearly with data parallelism. Ranks 16–32 become memory-bound on A100s when batch size > 4 and seq_len > 2048. For 70B+ models, the 0.1–1% heuristic shifts: practitioners often use r=64–128 because the expressiveness ceiling is higher, and memory is distributed across GPUs. Also, rank r should scale with model dimensionality: for hidden_dim=4096 (7B), r=8 is ~0.2%; for hidden_dim=8192 (70B), the same r=8 is only ~0.1%, which may be insufficient.
Rollback plan
If training loss plateaus or validation accuracy stalls after 10–20% of training (clear sign rank is insufficient): (1) Stop training immediately (don't wait for full completion). (2) Reload your base model fresh. (3) Increase rank by 2x (e.g., 8→16) and restart from epoch 0. (4) If memory allows, also increase lora_alpha proportionally. Do <strong>not</strong> resume from the previous checkpoint with a new rank: LoRA adapter shapes are fixed at initialization.
Debug symptoms
Training loss is flat or decreases very slowly; validation loss doesn't improve meaningfully after 20% of training.
Diagnosis
Rank is too low for the task complexity. The LoRA subspace cannot capture the necessary weight deltas.
Fix
Increase rank to r=16 (or 2x your current rank) and retrain from scratch. Check alpha is also scaled proportionally.
Out-of-memory (OOM) error during first training batch, even though you fit the quantized base model.
Diagnosis
Rank is too high, or you're adapting too many layers. 4-bit quantization saves memory in the base weights, but LoRA adds full-precision trainable matrices: high rank compounds this.
Fix
Reduce rank (8→4) or reduce target_modules (e.g., only q_proj and v_proj, drop k_proj). Re-test with a small batch size first.
Model overfits severely on training data but generalizes poorly; training loss → 0 but validation loss increases.
Diagnosis
Rank is too high relative to dataset size. With small training sets (<1000 examples) and high rank, the adapter overfits the exact training examples instead of learning generalizable patterns.
Fix
Reduce rank to 4–8 and increase lora_dropout to 0.1–0.2. Also increase regularization in your trainer (weight_decay > 0).
Production upgrade path
Tutorial version: pick r=8, move on. Production version: (1) Run a rank sweep [4, 8, 16] on 10% of your training data for 1–2 epochs, measure validation metric. (2) Plot rank vs. validation_metric and rank vs. training_time. Pick the rank that gives 95% of max accuracy with <70% of max training time. (3) Lock that rank and run full training. (4) Log rank choice to your experiment tracker (e.g., Weights & Biases) alongside final metrics: when your model ages and needs retraining, you have the provenance of your hyperparameter choice.
Common gotcha
Setting alpha independently of rank without understanding the interaction. The default heuristic alpha=lora_alpha=2*r keeps the effective learning rate stable across different ranks. If you set r=8 but forget to increase alpha from a previous config's alpha=8 (meant for r=4), your effective scaling is halved: the model trains much slower than expected, loss curves look muted, and you wrongly conclude the task is too hard. Always verify alpha/r ratio in your config before training.
Experienced dev note
In production, don't pick rank in isolation: it's entangled with three other hyperparameters: (1) target_modules: adapting only q_proj+v_proj vs. all linear layers changes effective capacity; fewer modules = lower rank needed. (2) lora_dropout: higher dropout (0.1+) acts like regularization, so you can use lower rank without underfitting. (3) training data size: rank scales with data. For 100-example few-shot, r=4 is often enough; for 100k-example domain adaptation, r=16–32 is standard. Real practitioners often run a quick 500-step sweep across [4, 8, 16] on a validation set before committing to a full run. Also: the alpha=2*r heuristic assumes you're using the default LoRA initialization. If your codebase overrides lora_init_lora_weights, that heuristic breaks: measure effective gradient scales instead.
Check your understanding
You're fine-tuning a 70B model on 5,000 instruction-response pairs. Your baseline with r=8 reaches 92% validation accuracy but plateaus. You increase to r=16 and get 94%, then to r=32 and get 94.1%: barely better. What does this tell you about whether you should use r=32 in production?
Show answer hint
This is not a simple accuracy question. The marginal improvement (0.1%) doesn't justify 4x more parameters and 4x more training time/memory. The diminishing returns suggest you've hit the ceiling of task complexity or data annotation quality, not rank insufficiency. Use r=16 in production and invest your effort elsewhere (better data, longer training, different loss function).