How to Intermediate · 4 min read

LoRA for instruction following

Q: LoRA for instruction following

Use LoRA (Low-Rank Adaptation) to efficiently fine-tune large language models for instruction following by injecting trainable low-rank matrices into model layers. This enables adapting models to follow instructions better with fewer parameters and less compute than full fine-tuning. Combine LoRA with instruction-tuning datasets to improve task-specific responses.

Quick answer

Use LoRA (Low-Rank Adaptation) to efficiently fine-tune large language models for instruction following by injecting trainable low-rank matrices into model layers. This enables adapting models to follow instructions better with fewer parameters and less compute than full fine-tuning. Combine LoRA with instruction-tuning datasets to improve task-specific responses.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install peft
pip install datasets
Access to a pretrained LLM checkpoint (e.g., meta-llama/Llama-3.1-8B-Instruct)

Setup

Install the required Python packages for LoRA fine-tuning and dataset handling. Ensure you have a pretrained instruction-tuned base model checkpoint to start from.

bash

pip install transformers peft datasets

Step by step

This example shows how to apply LoRA fine-tuning on an instruction-following LLM using the peft library with Hugging Face transformers. It loads a pretrained instruction-tuned model, applies LoRA adapters, fine-tunes on a sample instruction dataset, and saves the adapted model.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import torch

# Load pretrained instruction-tuned model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Load a small instruction-following dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1%]")

# Tokenize function
max_length = 512
def tokenize_fn(example):
    return tokenizer(example["instruction"] + "\n" + example["input"] + "\n" + example["output"],
                     truncation=True, max_length=max_length)

# Prepare dataset
dataset = dataset.map(tokenize_fn, batched=False)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-instruction",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=100,
    save_total_limit=1,
    fp16=True,
    optim="adamw_torch"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

# Fine-tune
trainer.train()

# Save LoRA adapters
model.save_pretrained("./lora-instruction")

output

***** Running training *****
  Num examples = 500
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 125
...
Saving model checkpoint to ./lora-instruction
Trainer model saved.

Common variations

Use load_in_4bit=True with BitsAndBytesConfig for memory-efficient LoRA fine-tuning (QLoRA).
Apply LoRA to different base models like meta-llama/Llama-3.3-70b-Instruct for larger scale.
Use accelerate or distributed training for faster fine-tuning on multiple GPUs.
Streamline inference by loading only LoRA adapters on top of the frozen base model.

Troubleshooting

If you get CUDA out-of-memory errors, reduce batch size or use 4-bit quantization with QLoRA.
Ensure target_modules in LoraConfig match your model architecture (e.g., q_proj, v_proj for LLaMA).
Verify tokenizer and model versions match to avoid tokenization errors.
Use torch_dtype=torch.float16 or bf16 for mixed precision to save memory.

✅

Key Takeaways

LoRA fine-tuning injects trainable low-rank matrices to adapt LLMs efficiently for instruction following.
Use the peft library with Hugging Face transformers for easy LoRA integration.
QLoRA combines LoRA with 4-bit quantization for memory-efficient fine-tuning on large models.
Match target_modules in LoRA config to your model’s architecture for best results.
Fine-tune on instruction datasets like Alpaca or OpenAssistant to improve instruction following.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.3-70b-Instruct

Verify ↗