How to beginner to intermediate · 3 min read

How to evaluate fine-tuned model Hugging Face

Quick answer
Use the Hugging Face transformers and datasets libraries to load your fine-tuned model and evaluation dataset, then run the model on the test set and compute metrics like accuracy or F1 with evaluate. This approach provides a straightforward way to assess your model's performance programmatically.

PREREQUISITES

  • Python 3.8+
  • pip install transformers datasets evaluate
  • Hugging Face API token if loading private models

Setup

Install the necessary libraries to load and evaluate your fine-tuned Hugging Face model.

bash
pip install transformers datasets evaluate

Step by step

Load your fine-tuned model and tokenizer, prepare the evaluation dataset, run predictions, and compute metrics like accuracy or F1 score.

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
import evaluate
import torch
import os

# Load fine-tuned model and tokenizer
model_name = os.environ.get("HF_FINE_TUNED_MODEL", "your-fine-tuned-model")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load evaluation dataset
# Example: Using the 'glue' dataset's 'mrpc' validation split
raw_datasets = load_dataset("glue", "mrpc")
eval_dataset = raw_datasets["validation"]

# Tokenize the dataset

def preprocess_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding=True)

tokenized_eval = eval_dataset.map(preprocess_function, batched=True)

# Prepare metric
metric = evaluate.load("glue", "mrpc")

# Run evaluation
model.eval()

all_predictions = []
all_labels = []

for batch in tokenized_eval:
    inputs = {k: torch.tensor([batch[k]]) for k in ['input_ids', 'attention_mask']}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1).cpu().numpy()
    all_predictions.extend(predictions)
    all_labels.extend([batch['label']])

# Compute metrics
results = metric.compute(predictions=all_predictions, references=all_labels)
print("Evaluation results:", results)
output
Evaluation results: {'accuracy': 0.85, 'f1': 0.90}

Common variations

You can evaluate on different datasets by changing load_dataset parameters, use GPU by moving tensors to CUDA, or evaluate generation models by computing BLEU or ROUGE metrics instead.

python
import torch

# Move model to GPU if available
if torch.cuda.is_available():
    model.to('cuda')

# Example for generation model evaluation (e.g., summarization)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Use evaluate library for ROUGE
rouge = evaluate.load('rouge')

# Generate predictions and compute ROUGE
# ... (depends on task and dataset)

Troubleshooting

  • If you get CUDA out of memory errors, reduce batch size or run on CPU.
  • If tokenizer or model loading fails, verify the model name and your Hugging Face token permissions.
  • If metrics seem off, ensure labels and predictions align correctly in shape and order.

Key Takeaways

  • Use Hugging Face Transformers and Datasets libraries to load and preprocess evaluation data.
  • Compute standard metrics with the evaluate library for reliable performance measurement.
  • Adjust device usage and batch size to optimize evaluation speed and memory.
  • Verify model and tokenizer compatibility to avoid loading errors.
  • Ensure predictions and labels are correctly aligned before metric computation.
Verified 2026-04 · AutoModelForSequenceClassification, AutoTokenizer
Verify ↗