How to evaluate fine-tuned model
accuracy, loss, F1-score, or domain-specific benchmarks on a held-out validation or test dataset. You can also perform qualitative checks by generating outputs on representative prompts and comparing them to expected results.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable to access the OpenAI API for fine-tuned model evaluation.
pip install openai>=1.0 Step by step
Use the OpenAI API to send evaluation prompts to your fine-tuned model and compare the responses against expected outputs. Calculate metrics like accuracy or F1-score on the test set.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example test dataset: list of (prompt, expected_response)
test_data = [
("Translate 'hello' to French", "bonjour"),
("Summarize: AI is transforming industries.", "AI is changing many fields."),
]
correct = 0
for prompt, expected in test_data:
response = client.chat.completions.create(
model="ft-your-fine-tuned-model",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip().lower()
if output == expected.lower():
correct += 1
accuracy = correct / len(test_data)
print(f"Accuracy on test set: {accuracy:.2f}") Accuracy on test set: 1.00
Common variations
You can evaluate asynchronously, use streaming for large outputs, or switch models by changing the model parameter. For more complex tasks, use libraries like sklearn to compute precision, recall, and F1-score.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Async evaluation example
import asyncio
async def evaluate_async():
test_data = [("Explain AI", "AI is artificial intelligence.")]
correct = 0
for prompt, expected in test_data:
response = await client.chat.completions.acreate(
model="ft-your-fine-tuned-model",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip().lower()
if output == expected.lower():
correct += 1
accuracy = correct / len(test_data)
print(f"Async accuracy: {accuracy:.2f}")
asyncio.run(evaluate_async()) Async accuracy: 1.00
Troubleshooting
If your evaluation accuracy is unexpectedly low, verify that your test prompts match the fine-tuning domain and that the expected outputs are correctly formatted. Also, ensure you are calling the correct fine-tuned model ID and that your API key has access.
Key Takeaways
- Use a held-out test set with known expected outputs to quantitatively evaluate your fine-tuned model.
- Calculate metrics like accuracy or F1-score to measure performance objectively.
- Test with real-world prompts to qualitatively assess model behavior and domain fit.
- Use the OpenAI SDK v1+ with environment variables for secure, reproducible evaluation.
- If results are poor, check prompt formatting, model ID, and API access permissions.