How to beginner · 4 min read

How to evaluate fine-tuned model

Quick answer

To evaluate a fine-tuned model, use quantitative metrics such as accuracy, loss, F1-score, or domain-specific benchmarks on a held-out validation or test dataset. You can also perform qualitative checks by generating outputs on representative prompts and comparing them to expected results.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to access the OpenAI API for fine-tuned model evaluation.

bash

pip install openai>=1.0

Step by step

Use the OpenAI API to send evaluation prompts to your fine-tuned model and compare the responses against expected outputs. Calculate metrics like accuracy or F1-score on the test set.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example test dataset: list of (prompt, expected_response)
test_data = [
    ("Translate 'hello' to French", "bonjour"),
    ("Summarize: AI is transforming industries.", "AI is changing many fields."),
]

correct = 0
for prompt, expected in test_data:
    response = client.chat.completions.create(
        model="ft-your-fine-tuned-model",
        messages=[{"role": "user", "content": prompt}]
    )
    output = response.choices[0].message.content.strip().lower()
    if output == expected.lower():
        correct += 1

accuracy = correct / len(test_data)
print(f"Accuracy on test set: {accuracy:.2f}")

output

Accuracy on test set: 1.00

Common variations

You can evaluate asynchronously, use streaming for large outputs, or switch models by changing the model parameter. For more complex tasks, use libraries like sklearn to compute precision, recall, and F1-score.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Async evaluation example
import asyncio

async def evaluate_async():
    test_data = [("Explain AI", "AI is artificial intelligence.")]
    correct = 0
    for prompt, expected in test_data:
        response = await client.chat.completions.acreate(
            model="ft-your-fine-tuned-model",
            messages=[{"role": "user", "content": prompt}]
        )
        output = response.choices[0].message.content.strip().lower()
        if output == expected.lower():
            correct += 1
    accuracy = correct / len(test_data)
    print(f"Async accuracy: {accuracy:.2f}")

asyncio.run(evaluate_async())

output

Async accuracy: 1.00

Troubleshooting

If your evaluation accuracy is unexpectedly low, verify that your test prompts match the fine-tuning domain and that the expected outputs are correctly formatted. Also, ensure you are calling the correct fine-tuned model ID and that your API key has access.

✅

Key Takeaways

Use a held-out test set with known expected outputs to quantitatively evaluate your fine-tuned model.
Calculate metrics like accuracy or F1-score to measure performance objectively.
Test with real-world prompts to qualitatively assess model behavior and domain fit.
Use the OpenAI SDK v1+ with environment variables for secure, reproducible evaluation.
If results are poor, check prompt formatting, model ID, and API access permissions.

Verified 2026-04 · ft-your-fine-tuned-model, gpt-4o

Verify ↗