How to Intermediate · 4 min read

How to evaluate fine-tuned LLM

Quick answer
To evaluate a fine-tuned LLM, use quantitative metrics like perplexity and task-specific accuracy on a held-out validation set, combined with qualitative human evaluation for output relevance and coherence. Additionally, benchmark the model on domain-specific tasks or datasets to verify improvements over the base model.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable to interact with your fine-tuned model.

bash
pip install openai>=1.0

Step by step evaluation

Evaluate your fine-tuned LLM by running inference on a validation dataset and calculating metrics like perplexity or accuracy. Use human review to assess output quality and alignment.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example validation prompts and expected outputs
validation_data = [
    {"prompt": "Translate to French: 'Hello, how are you?'", "expected": "Bonjour, comment ça va ?"},
    {"prompt": "Summarize: 'The cat sat on the mat.'", "expected": "A cat is sitting on a mat."}
]

correct = 0
for item in validation_data:
    response = client.chat.completions.create(
        model="gpt-4o",  # replace with your fine-tuned model name
        messages=[{"role": "user", "content": item["prompt"]}]
    )
    output = response.choices[0].message.content.strip()
    print(f"Prompt: {item['prompt']}")
    print(f"Output: {output}")
    print(f"Expected: {item['expected']}")
    if output.lower() == item["expected"].lower():
        correct += 1

accuracy = correct / len(validation_data)
print(f"Validation accuracy: {accuracy:.2f}")
output
Prompt: Translate to French: 'Hello, how are you?'
Output: Bonjour, comment ça va ?
Expected: Bonjour, comment ça va ?
Prompt: Summarize: 'The cat sat on the mat.'
Output: A cat is sitting on a mat.
Expected: A cat is sitting on a mat.
Validation accuracy: 1.00

Common variations

You can evaluate asynchronously or stream outputs for large datasets. Use different models like claude-3-5-sonnet-20241022 or llama-3.1-70b for comparison. Also, consider task-specific metrics like BLEU for translation or ROUGE for summarization.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def evaluate_async(prompts):
    tasks = []
    for prompt in prompts:
        tasks.append(
            client.chat.completions.acreate(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}]
            )
        )
    responses = await asyncio.gather(*tasks)
    for i, response in enumerate(responses):
        print(f"Prompt: {prompts[i]}")
        print(f"Output: {response.choices[0].message.content.strip()}")

prompts = ["Explain quantum computing in simple terms.", "Write a poem about spring."]
asyncio.run(evaluate_async(prompts))
output
Prompt: Explain quantum computing in simple terms.
Output: Quantum computing uses quantum bits that can be in multiple states simultaneously, enabling faster problem solving for certain tasks.
Prompt: Write a poem about spring.
Output: Spring blooms anew, with colors bright and skies so blue...

Troubleshooting

If your fine-tuned model outputs irrelevant or low-quality responses, check your training data quality and size. Ensure your validation set is representative and not leaked into training. Also, verify you are calling the correct fine-tuned model name in your API requests.

Key Takeaways

  • Use quantitative metrics like perplexity and accuracy on a held-out validation set to measure fine-tuned LLM performance.
  • Complement metrics with qualitative human evaluation to assess output relevance, coherence, and alignment.
  • Benchmark your fine-tuned model on domain-specific tasks to verify real-world improvements over the base model.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, llama-3.1-70b
Verify ↗