How to beginner to intermediate · 3 min read

How to evaluate fine-tuned model Hugging Face

Q: How to evaluate fine-tuned model Hugging Face

Use the Hugging Face transformers and datasets libraries to load your fine-tuned model and evaluation dataset, then run the model on the test set and compute metrics like accuracy or F1 with evaluate. This approach provides a straightforward way to assess your model's performance programmatically.

Quick answer

Use the Hugging Face transformers and datasets libraries to load your fine-tuned model and evaluation dataset, then run the model on the test set and compute metrics like accuracy or F1 with evaluate. This approach provides a straightforward way to assess your model's performance programmatically.

PREREQUISITES

Python 3.8+
pip install transformers datasets evaluate
Hugging Face API token if loading private models

Setup

Install the necessary libraries to load and evaluate your fine-tuned Hugging Face model.

bash

pip install transformers datasets evaluate

Step by step

Load your fine-tuned model and tokenizer, prepare the evaluation dataset, run predictions, and compute metrics like accuracy or F1 score.

python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
import evaluate
import torch
import os

# Load fine-tuned model and tokenizer
model_name = os.environ.get("HF_FINE_TUNED_MODEL", "your-fine-tuned-model")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load evaluation dataset
# Example: Using the 'glue' dataset's 'mrpc' validation split
raw_datasets = load_dataset("glue", "mrpc")
eval_dataset = raw_datasets["validation"]

# Tokenize the dataset

def preprocess_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding=True)

tokenized_eval = eval_dataset.map(preprocess_function, batched=True)

# Prepare metric
metric = evaluate.load("glue", "mrpc")

# Run evaluation
model.eval()

all_predictions = []
all_labels = []

for batch in tokenized_eval:
    inputs = {k: torch.tensor([batch[k]]) for k in ['input_ids', 'attention_mask']}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1).cpu().numpy()
    all_predictions.extend(predictions)
    all_labels.extend([batch['label']])

# Compute metrics
results = metric.compute(predictions=all_predictions, references=all_labels)
print("Evaluation results:", results)

output

Evaluation results: {'accuracy': 0.85, 'f1': 0.90}

Common variations

You can evaluate on different datasets by changing load_dataset parameters, use GPU by moving tensors to CUDA, or evaluate generation models by computing BLEU or ROUGE metrics instead.

python

import torch

# Move model to GPU if available
if torch.cuda.is_available():
    model.to('cuda')

# Example for generation model evaluation (e.g., summarization)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Use evaluate library for ROUGE
rouge = evaluate.load('rouge')

# Generate predictions and compute ROUGE
# ... (depends on task and dataset)

Troubleshooting

If you get CUDA out of memory errors, reduce batch size or run on CPU.
If tokenizer or model loading fails, verify the model name and your Hugging Face token permissions.
If metrics seem off, ensure labels and predictions align correctly in shape and order.

✅

Key Takeaways

Use Hugging Face Transformers and Datasets libraries to load and preprocess evaluation data.
Compute standard metrics with the evaluate library for reliable performance measurement.
Adjust device usage and batch size to optimize evaluation speed and memory.
Verify model and tokenizer compatibility to avoid loading errors.
Ensure predictions and labels are correctly aligned before metric computation.

Verified 2026-04 · AutoModelForSequenceClassification, AutoTokenizer

Verify ↗