How to Intermediate · 3 min read

How to use LoRA with Llama

Q: How to use LoRA with Llama

Use the peft library with transformers to apply LoRA to Llama models. Load the base Llama model, configure LoRA with LoraConfig, and wrap the model with get_peft_model for efficient fine-tuning and inference.

Quick answer

Use the peft library with transformers to apply LoRA to Llama models. Load the base Llama model, configure LoRA with LoraConfig, and wrap the model with get_peft_model for efficient fine-tuning and inference.

PREREQUISITES

Python 3.8+
pip install transformers peft torch
Access to a Llama model checkpoint (e.g., meta-llama/Llama-3.1-8B-Instruct)
Basic knowledge of PyTorch

Setup

Install the required Python packages transformers, peft, and torch. Ensure you have access to a Llama model checkpoint compatible with Hugging Face transformers.

bash

pip install transformers peft torch

Step by step

This example shows how to load a Llama model, configure LoRA, and prepare it for fine-tuning or inference.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base Llama model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                  # LoRA rank
    lora_alpha=32,         # LoRA scaling
    target_modules=["q_proj", "v_proj"],  # Modules to apply LoRA
    lora_dropout=0.05,     # Dropout for LoRA layers
    task_type=TaskType.CAUSAL_LM
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)

# Example inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you? I am fine, thank you.

Common variations

Use BitsAndBytesConfig with load_in_4bit=True for 4-bit quantization combined with LoRA to reduce memory usage.
Change target_modules in LoraConfig depending on the Llama model architecture.
Use torch.compile() or accelerate for optimized training loops.

python

from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quant_config
)

# Then apply LoRA as before
model = get_peft_model(model, lora_config)

Troubleshooting

If you get ModuleNotFoundError, ensure peft and transformers are installed and up to date.
For CUDA out-of-memory errors, reduce batch size or use 4-bit quantization.
If target_modules do not match model layers, inspect model architecture with print(model) to find correct module names.

Key Takeaways

Use peft with transformers to apply LoRA on Llama models efficiently.
Configure LoraConfig with appropriate target_modules for your Llama variant.
Combine LoRA with quantization like 4-bit for memory-efficient fine-tuning.
Always load models with device_map="auto" and torch_dtype=torch.float16 for best performance.
Check model layer names if LoRA modules do not apply correctly.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.