How to evaluate fine-tuned LLM
LLM, use quantitative metrics like perplexity and task-specific accuracy on a held-out validation set, combined with qualitative human evaluation for output relevance and coherence. Additionally, benchmark the model on domain-specific tasks or datasets to verify improvements over the base model.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable to interact with your fine-tuned model.
pip install openai>=1.0 Step by step evaluation
Evaluate your fine-tuned LLM by running inference on a validation dataset and calculating metrics like perplexity or accuracy. Use human review to assess output quality and alignment.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example validation prompts and expected outputs
validation_data = [
{"prompt": "Translate to French: 'Hello, how are you?'", "expected": "Bonjour, comment ça va ?"},
{"prompt": "Summarize: 'The cat sat on the mat.'", "expected": "A cat is sitting on a mat."}
]
correct = 0
for item in validation_data:
response = client.chat.completions.create(
model="gpt-4o", # replace with your fine-tuned model name
messages=[{"role": "user", "content": item["prompt"]}]
)
output = response.choices[0].message.content.strip()
print(f"Prompt: {item['prompt']}")
print(f"Output: {output}")
print(f"Expected: {item['expected']}")
if output.lower() == item["expected"].lower():
correct += 1
accuracy = correct / len(validation_data)
print(f"Validation accuracy: {accuracy:.2f}") Prompt: Translate to French: 'Hello, how are you?' Output: Bonjour, comment ça va ? Expected: Bonjour, comment ça va ? Prompt: Summarize: 'The cat sat on the mat.' Output: A cat is sitting on a mat. Expected: A cat is sitting on a mat. Validation accuracy: 1.00
Common variations
You can evaluate asynchronously or stream outputs for large datasets. Use different models like claude-3-5-sonnet-20241022 or llama-3.1-70b for comparison. Also, consider task-specific metrics like BLEU for translation or ROUGE for summarization.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def evaluate_async(prompts):
tasks = []
for prompt in prompts:
tasks.append(
client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
)
responses = await asyncio.gather(*tasks)
for i, response in enumerate(responses):
print(f"Prompt: {prompts[i]}")
print(f"Output: {response.choices[0].message.content.strip()}")
prompts = ["Explain quantum computing in simple terms.", "Write a poem about spring."]
asyncio.run(evaluate_async(prompts)) Prompt: Explain quantum computing in simple terms. Output: Quantum computing uses quantum bits that can be in multiple states simultaneously, enabling faster problem solving for certain tasks. Prompt: Write a poem about spring. Output: Spring blooms anew, with colors bright and skies so blue...
Troubleshooting
If your fine-tuned model outputs irrelevant or low-quality responses, check your training data quality and size. Ensure your validation set is representative and not leaked into training. Also, verify you are calling the correct fine-tuned model name in your API requests.
Key Takeaways
- Use quantitative metrics like perplexity and accuracy on a held-out validation set to measure fine-tuned LLM performance.
- Complement metrics with qualitative human evaluation to assess output relevance, coherence, and alignment.
- Benchmark your fine-tuned model on domain-specific tasks to verify real-world improvements over the base model.