How to Intermediate · 3 min read

What metrics to use for fine-tuned model evaluation

Quick answer
Use metrics such as accuracy, perplexity, F1 score, and BLEU to evaluate fine-tuned models depending on the task. For classification, accuracy and F1 score are standard, while for language generation, perplexity and BLEU measure fluency and relevance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to access fine-tuned models and evaluate them.

bash
pip install openai>=1.0

Step by step

Evaluate a fine-tuned classification model using accuracy and F1 score with Python's sklearn library. For language models, compute perplexity from log-likelihoods. Here's a complete example for classification evaluation.

python
import os
from openai import OpenAI
from sklearn.metrics import accuracy_score, f1_score

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample test data
texts = ["Example input 1", "Example input 2"]
true_labels = [0, 1]  # Ground truth labels

# Call fine-tuned model for predictions
pred_labels = []
for text in texts:
    response = client.chat.completions.create(
        model="gpt-4o",  # Replace with your fine-tuned model ID
        messages=[{"role": "user", "content": text}]
    )
    pred_label = int(response.choices[0].message.content.strip())
    pred_labels.append(pred_label)

# Calculate metrics
acc = accuracy_score(true_labels, pred_labels)
f1 = f1_score(true_labels, pred_labels)

print(f"Accuracy: {acc:.2f}")
print(f"F1 Score: {f1:.2f}")
output
Accuracy: 1.00
F1 Score: 1.00

Common variations

For language generation tasks, use perplexity calculated from the model's log probabilities to measure fluency. Use BLEU or ROUGE scores for translation or summarization quality. Async calls and streaming responses are supported in the OpenAI SDK for real-time evaluation.

MetricUse caseDescription
AccuracyClassificationPercentage of correct predictions
F1 ScoreClassificationHarmonic mean of precision and recall
PerplexityLanguage modelingExponentiated average negative log-likelihood
BLEUTranslationN-gram overlap between generated and reference text
ROUGESummarizationOverlap of recall-oriented n-grams

Troubleshooting

If predictions are not in expected format, verify model output parsing logic. For low metric scores, check data quality and ensure the fine-tuned model matches the evaluation task. If API calls fail, confirm your API key and model ID are correct.

Key Takeaways

  • Use task-appropriate metrics: accuracy and F1 for classification, perplexity and BLEU for generation.
  • Calculate perplexity from model log-likelihoods to assess language model fluency.
  • Parse model outputs carefully to extract predicted labels or generated text for metric computation.
Verified 2026-04 · gpt-4o
Verify ↗