Workflow Intermediate medium · 8 min decision_step

Rank r: the key parameter

What you will learn

Rank r controls how many trainable parameters LoRA adds to your model: too low loses expressiveness, too high defeats the compression goal.

Step 2: Configure LoRA Adapter (immediately after choosing your base model and before instantiating LoraConfig)

Why this matters

Rank r is the primary dial that determines your memory footprint, training speed, and whether the fine-tune can actually capture the task-specific patterns. A rank that's too conservative will produce a model that ignores your training data; a rank that's too aggressive wastes the entire point of LoRA and consumes 10–100x more VRAM.

Explanation

What rank r means: LoRA adds low-rank decomposition matrices to each linear layer. Rank r is the inner dimension of those matrices. A rank of 8 means each weight matrix W gets a A (d_in × 8) and B (8 × d_out) pair, where the product BA approximates the weight delta. Higher rank = more expressiveness, but also more parameters and memory. How to choose: Start with r=8 for most tasks; increase to 16 or 32 if the task is complex (classification, domain shift) or your model is very large (70B+). For small models or simple tasks (instruction-following on similar domain), r=4 often suffices. The rule of thumb: total LoRA params should be 0.1–1% of the model's full parameter count. What to watch: Rank interacts with alpha (the scaling factor). If you set alpha=2*r (the common heuristic), monitor whether training loss plateaus or oscillates: that's a sign rank is insufficient.

Code

python

# pip install peft bitsandbytes torch transformers
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'meta-llama/Llama-2-7b-hf'
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map='auto'
)

ranks_to_test = [4, 8, 16]
configs = {}

for r in ranks_to_test:
    lora_config = LoraConfig(
        r=r,
        lora_alpha=2 * r,
        target_modules=['q_proj', 'v_proj'],
        lora_dropout=0.05,
        bias='none',
        task_type='CAUSAL_LM'
    )
    configs[f'rank_{r}'] = lora_config
    peft_model = get_peft_model(model, lora_config)
    
    trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in peft_model.parameters())
    percent = 100.0 * trainable_params / total_params
    
    print(f'Rank {r}: {trainable_params:,} trainable params / {total_params:,} total ({percent:.3f}%)')

Output

Rank 4: 2,097,152 trainable params / 3,274,190,848 total (0.064%)
Rank 8: 4,194,304 trainable params / 3,274,190,848 total (0.128%)
Rank 16: 8,388,608 trainable params / 3,274,190,848 total (0.256%)

Your options

Recommended

r=4

Small models (7B), very similar domain (e.g., fine-tuning a code LLM on more code), or when VRAM is critical and you accept some accuracy trade-off.

Pros

Minimal memory overhead (~0.4% params added for 7B model). Fast training. Inference latency nearly unchanged.

Cons

May underfit on diverse tasks. High-dimensional semantic shifts (language to code) might compress poorly into rank-4 subspace.

from peft import LoraConfig
lora_config = LoraConfig(
    r=4,
    lora_alpha=8,
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

r=8

Default for most tasks. Medium models (13B–30B), standard fine-tuning (instruction-tuning, domain adaptation within same modality).

Pros

Empirically best balance for typical fine-tuning. Widely benchmarked. Usually sufficient for 1–10% training data.

Cons

Not universally optimal: some very large models (70B) or very different tasks (vision→language) may need higher rank.

from peft import LoraConfig
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

r=16 or r=32

Large models (70B+), complex task (e.g., code-to-math reasoning, cross-modal alignment), or if baseline r=8 plateaus during validation.

Pros

Captures more task-specific structure. Better for out-of-distribution generalization.

Cons

Significantly higher memory (4x more params at r=32 vs r=8). Training slower. Risk of overfitting on small datasets.

from peft import LoraConfig
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=['q_proj', 'v_proj', 'k_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

Validation step

Concrete checks before proceeding: (1) After instantiating your LoraConfig with your chosen rank, compute the total LoRA parameter count (shown in the code above) and verify it's in the range 0.05–1% of your base model's parameters. (2) Before training, print the LoRA config via <code>print(lora_config)</code> and confirm r matches your intended value and alpha = 2*r (or your intentional override). (3) After 1–2 training steps, check that <code>model.parameters()</code> includes tensors of shape <code>(d, r)</code> and <code>(r, d_out)</code> in the q_proj and v_proj layers: not full-rank weight matrices.

At scale

At production scale (100B+ model, distributed training): Rank r interacts nonlinearly with data parallelism. Ranks 16–32 become memory-bound on A100s when batch size > 4 and seq_len > 2048. For 70B+ models, the 0.1–1% heuristic shifts: practitioners often use r=64–128 because the expressiveness ceiling is higher, and memory is distributed across GPUs. Also, rank r should scale with model dimensionality: for hidden_dim=4096 (7B), r=8 is ~0.2%; for hidden_dim=8192 (70B), the same r=8 is only ~0.1%, which may be insufficient.

↩

Rollback plan

If training loss plateaus or validation accuracy stalls after 10–20% of training (clear sign rank is insufficient): (1) Stop training immediately (don't wait for full completion). (2) Reload your base model fresh. (3) Increase rank by 2x (e.g., 8→16) and restart from epoch 0. (4) If memory allows, also increase lora_alpha proportionally. Do not resume from the previous checkpoint with a new rank: LoRA adapter shapes are fixed at initialization.

Debug symptoms

Training loss is flat or decreases very slowly; validation loss doesn't improve meaningfully after 20% of training.

Diagnosis

Rank is too low for the task complexity. The LoRA subspace cannot capture the necessary weight deltas.

Fix

Increase rank to r=16 (or 2x your current rank) and retrain from scratch. Check alpha is also scaled proportionally.

Out-of-memory (OOM) error during first training batch, even though you fit the quantized base model.

Diagnosis

Rank is too high, or you're adapting too many layers. 4-bit quantization saves memory in the base weights, but LoRA adds full-precision trainable matrices: high rank compounds this.

Fix

Reduce rank (8→4) or reduce target_modules (e.g., only q_proj and v_proj, drop k_proj). Re-test with a small batch size first.

Model overfits severely on training data but generalizes poorly; training loss → 0 but validation loss increases.

Diagnosis

Rank is too high relative to dataset size. With small training sets (<1000 examples) and high rank, the adapter overfits the exact training examples instead of learning generalizable patterns.

Fix

Reduce rank to 4–8 and increase lora_dropout to 0.1–0.2. Also increase regularization in your trainer (weight_decay > 0).

Production upgrade path

Tutorial version: pick r=8, move on. Production version: (1) Run a rank sweep [4, 8, 16] on 10% of your training data for 1–2 epochs, measure validation metric. (2) Plot rank vs. validation_metric and rank vs. training_time. Pick the rank that gives 95% of max accuracy with <70% of max training time. (3) Lock that rank and run full training. (4) Log rank choice to your experiment tracker (e.g., Weights & Biases) alongside final metrics: when your model ages and needs retraining, you have the provenance of your hyperparameter choice.

Common gotcha

Setting alpha independently of rank without understanding the interaction. The default heuristic alpha=lora_alpha=2*r keeps the effective learning rate stable across different ranks. If you set r=8 but forget to increase alpha from a previous config's alpha=8 (meant for r=4), your effective scaling is halved: the model trains much slower than expected, loss curves look muted, and you wrongly conclude the task is too hard. Always verify alpha/r ratio in your config before training.

Experienced dev note

In production, don't pick rank in isolation: it's entangled with three other hyperparameters: (1) target_modules: adapting only q_proj+v_proj vs. all linear layers changes effective capacity; fewer modules = lower rank needed. (2) lora_dropout: higher dropout (0.1+) acts like regularization, so you can use lower rank without underfitting. (3) training data size: rank scales with data. For 100-example few-shot, r=4 is often enough; for 100k-example domain adaptation, r=16–32 is standard. Real practitioners often run a quick 500-step sweep across [4, 8, 16] on a validation set before committing to a full run. Also: the alpha=2*r heuristic assumes you're using the default LoRA initialization. If your codebase overrides lora_init_lora_weights, that heuristic breaks: measure effective gradient scales instead.

Check your understanding

You're fine-tuning a 70B model on 5,000 instruction-response pairs. Your baseline with r=8 reaches 92% validation accuracy but plateaus. You increase to r=16 and get 94%, then to r=32 and get 94.1%: barely better. What does this tell you about whether you should use r=32 in production?

Show answer hint

This is not a simple accuracy question. The marginal improvement (0.1%) doesn't justify 4x more parameters and 4x more training time/memory. The diminishing returns suggest you've hit the ceiling of task complexity or data annotation quality, not rank insufficiency. Use r=16 in production and invest your effort elsewhere (better data, longer training, different loss function).

Community Notes

No notes yetBe the first to share a version-specific fix or tip.