How to evaluate LoRA fine-tuned model
Quick answer
To evaluate a
LoRA fine-tuned model, load the model with the PEFT library and run it on a validation dataset to measure metrics like accuracy, perplexity, or generation quality. Use standard evaluation scripts with your base model and apply the LoRA weights for inference to compare performance before and after fine-tuning.PREREQUISITES
Python 3.8+pip install torch transformers peft datasetsAccess to a LoRA fine-tuned model checkpoint
Setup
Install required Python packages for model loading, LoRA integration, and dataset handling.
pip install torch transformers peft datasets Step by step
Load the base model and apply LoRA weights using peft. Then run evaluation on a validation dataset to compute metrics like perplexity or accuracy.
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
# Load base model and tokenizer
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="auto")
# Load LoRA fine-tuned weights
lora_model_path = os.environ.get("LORA_MODEL_PATH") # e.g. "./lora-finetuned"
model = PeftModel.from_pretrained(model, lora_model_path)
model.eval()
# Load validation dataset (e.g., wikitext for perplexity)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")
# Tokenize dataset
inputs = tokenizer(dataset["text"], return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Evaluate perplexity
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
perplexity = torch.exp(loss)
print(f"Perplexity of LoRA fine-tuned model: {perplexity.item():.2f}") output
Perplexity of LoRA fine-tuned model: 12.34
Common variations
- Use different datasets like
datasets.load_dataset('squad')for QA accuracy evaluation. - Evaluate generation quality by sampling outputs and comparing with references.
- Run evaluation asynchronously or with batch processing for large datasets.
- Use quantized or 4-bit models with LoRA for faster inference.
Troubleshooting
- If you see CUDA out-of-memory errors, reduce batch size or use model quantization.
- If LoRA weights fail to load, verify the
LORA_MODEL_PATHpoints to correct checkpoint. - Ensure tokenizer and base model versions match to avoid tokenization errors.
- Check that
torch_dtypeand device mapping are compatible with your hardware.
Key Takeaways
- Load the base model and apply LoRA weights with the PEFT library for evaluation.
- Use standard NLP metrics like perplexity or accuracy on a validation dataset to measure performance.
- Match tokenizer and model versions to avoid tokenization mismatches during evaluation.