How to beginner · 3 min read

How to evaluate fine-tuned model

Quick answer
Use the OpenAI SDK v1 to call your fine-tuned model by specifying its name in model when creating a chat completion. Evaluate the output by sending test prompts and comparing responses to expected results programmatically or manually.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the latest OpenAI Python SDK and set your API key as an environment variable.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the OpenAI Python SDK to send test prompts to your fine-tuned model and print the responses for evaluation.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Replace with your fine-tuned model ID
fine_tuned_model = "ft:gpt-4o-mini-2024-07-18-abc123"

# Example test prompts
test_prompts = [
    "Explain the benefits of fine-tuning.",
    "Summarize the following text: OpenAI provides powerful APIs.",
    "What is RAG in AI?"
]

for prompt in test_prompts:
    response = client.chat.completions.create(
        model=fine_tuned_model,
        messages=[{"role": "user", "content": prompt}]
    )
    print(f"Prompt: {prompt}")
    print(f"Response: {response.choices[0].message.content}\n")
output
Prompt: Explain the benefits of fine-tuning.
Response: Fine-tuning allows a base model to specialize on specific tasks or domains, improving accuracy and relevance.

Prompt: Summarize the following text: OpenAI provides powerful APIs.
Response: OpenAI offers APIs that enable developers to integrate advanced AI capabilities into their applications.

Prompt: What is RAG in AI?
Response: RAG stands for Retrieval-Augmented Generation, a technique combining retrieval of documents with generative models for better answers.

Common variations

You can evaluate your fine-tuned model asynchronously or with streaming output. Also, test with different prompt formats or use other OpenAI models for comparison.

python
import asyncio
from openai import OpenAI

async def async_evaluate():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    fine_tuned_model = "ft:gpt-4o-mini-2024-07-18-abc123"
    prompt = "Describe the process of fine-tuning a model."

    response = await client.chat.completions.create(
        model=fine_tuned_model,
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    print("Streaming response:")
    async for chunk in response:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)

asyncio.run(async_evaluate())
output
Streaming response:
Fine-tuning a model involves training a pre-trained base model on your specific dataset to adapt it to your task, improving performance and relevance.

Troubleshooting

  • If you get a model not found error, verify your fine-tuned model ID is correct and active.
  • If responses are poor, check your training data quality and consider more training epochs.
  • Ensure your API key has permissions to access fine-tuned models.

Key Takeaways

  • Use the OpenAI SDK v1 chat.completions.create method with your fine-tuned model ID to evaluate.
  • Test multiple prompts and compare outputs to expected answers for thorough evaluation.
  • Async and streaming calls allow real-time evaluation and integration in interactive apps.
  • Verify your fine-tuned model ID and API key permissions if you encounter errors.
Verified 2026-04 · ft:gpt-4o-mini-2024-07-18-abc123, gpt-4o-mini
Verify ↗