How to Intermediate · 4 min read

How to evaluate LoRA fine-tuned model

Q: How to evaluate LoRA fine-tuned model

To evaluate a LoRA fine-tuned model, load the model with the PEFT library and run it on a validation dataset to measure metrics like accuracy, perplexity, or generation quality. Use standard evaluation scripts with your base model and apply the LoRA weights for inference to compare performance before and after fine-tuning.

Quick answer

To evaluate a LoRA fine-tuned model, load the model with the PEFT library and run it on a validation dataset to measure metrics like accuracy, perplexity, or generation quality. Use standard evaluation scripts with your base model and apply the LoRA weights for inference to compare performance before and after fine-tuning.

PREREQUISITES

Python 3.8+
pip install torch transformers peft datasets
Access to a LoRA fine-tuned model checkpoint

Setup

Install required Python packages for model loading, LoRA integration, and dataset handling.

bash

pip install torch transformers peft datasets

Step by step

Load the base model and apply LoRA weights using peft. Then run evaluation on a validation dataset to compute metrics like perplexity or accuracy.

python

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset

# Load base model and tokenizer
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="auto")

# Load LoRA fine-tuned weights
lora_model_path = os.environ.get("LORA_MODEL_PATH")  # e.g. "./lora-finetuned"
model = PeftModel.from_pretrained(model, lora_model_path)
model.eval()

# Load validation dataset (e.g., wikitext for perplexity)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="validation")

# Tokenize dataset
inputs = tokenizer(dataset["text"], return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Evaluate perplexity
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss
    perplexity = torch.exp(loss)

print(f"Perplexity of LoRA fine-tuned model: {perplexity.item():.2f}")

output

Perplexity of LoRA fine-tuned model: 12.34

Common variations

Use different datasets like datasets.load_dataset('squad') for QA accuracy evaluation.
Evaluate generation quality by sampling outputs and comparing with references.
Run evaluation asynchronously or with batch processing for large datasets.
Use quantized or 4-bit models with LoRA for faster inference.

Troubleshooting

If you see CUDA out-of-memory errors, reduce batch size or use model quantization.
If LoRA weights fail to load, verify the LORA_MODEL_PATH points to correct checkpoint.
Ensure tokenizer and base model versions match to avoid tokenization errors.
Check that torch_dtype and device mapping are compatible with your hardware.

✅

Key Takeaways

Load the base model and apply LoRA weights with the PEFT library for evaluation.
Use standard NLP metrics like perplexity or accuracy on a validation dataset to measure performance.
Match tokenizer and model versions to avoid tokenization mismatches during evaluation.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗