How to Intermediate · 4 min read

How to evaluate LoRA fine-tuned model

Quick answer
To evaluate a LoRA fine-tuned model, load the model with the PEFT library and run it on a validation dataset to measure metrics like accuracy, perplexity, or generation quality. Use standard evaluation scripts with your base model and apply the LoRA weights for inference to compare performance before and after fine-tuning.

PREREQUISITES

  • Python 3.8+
  • pip install torch transformers peft datasets
  • Access to a LoRA fine-tuned model checkpoint

Setup

Install required Python packages for model loading, LoRA integration, and dataset handling.

bash
pip install torch transformers peft datasets

Step by step

Load the base model and apply LoRA weights using peft. Then run evaluation on a validation dataset to compute metrics like perplexity or accuracy.

python
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset

# Load base model and tokenizer
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="auto")

# Load LoRA fine-tuned weights
lora_model_path = os.environ.get("LORA_MODEL_PATH")  # e.g. "./lora-finetuned"
model = PeftModel.from_pretrained(model, lora_model_path)
model.eval()

# Load validation dataset (e.g., wikitext for perplexity)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")

# Tokenize dataset
inputs = tokenizer(dataset["text"], return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Evaluate perplexity
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss
    perplexity = torch.exp(loss)

print(f"Perplexity of LoRA fine-tuned model: {perplexity.item():.2f}")
output
Perplexity of LoRA fine-tuned model: 12.34

Common variations

  • Use different datasets like datasets.load_dataset('squad') for QA accuracy evaluation.
  • Evaluate generation quality by sampling outputs and comparing with references.
  • Run evaluation asynchronously or with batch processing for large datasets.
  • Use quantized or 4-bit models with LoRA for faster inference.

Troubleshooting

  • If you see CUDA out-of-memory errors, reduce batch size or use model quantization.
  • If LoRA weights fail to load, verify the LORA_MODEL_PATH points to correct checkpoint.
  • Ensure tokenizer and base model versions match to avoid tokenization errors.
  • Check that torch_dtype and device mapping are compatible with your hardware.

Key Takeaways

  • Load the base model and apply LoRA weights with the PEFT library for evaluation.
  • Use standard NLP metrics like perplexity or accuracy on a validation dataset to measure performance.
  • Match tokenizer and model versions to avoid tokenization mismatches during evaluation.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗