How to Intermediate · 3 min read

How to evaluate LLM performance

Q: How to evaluate LLM performance

Evaluate LLM performance using metrics like perplexity for language modeling quality, accuracy or F1 for classification tasks, and human evaluation for fluency and relevance. Automated benchmarks combined with real user feedback provide a comprehensive assessment.

Quick answer

Evaluate LLM performance using metrics like perplexity for language modeling quality, accuracy or F1 for classification tasks, and human evaluation for fluency and relevance. Automated benchmarks combined with real user feedback provide a comprehensive assessment.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

This example shows how to calculate perplexity using the gpt-4o model by scoring a prompt and computing the exponential of the average negative log-likelihood. It also demonstrates a simple accuracy check for a classification task.

python

import os
from openai import OpenAI
import math

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example prompt and expected completion
prompt = "Translate English to French: 'Hello, how are you?'"
expected = "Bonjour, comment ça va ?"

# Get log probabilities from the model
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    max_tokens=20,
    logprobs=5
)

# Extract token logprobs
logprobs = response.choices[0].logprobs.token_logprobs

# Calculate perplexity
avg_neg_logprob = -sum(logprobs) / len(logprobs)
perplexity = math.exp(avg_neg_logprob)
print(f"Perplexity: {perplexity:.2f}")

# Simple accuracy check for classification
user_answer = response.choices[0].message.content.strip()
accuracy = 1 if user_answer == expected else 0
print(f"Accuracy: {accuracy}")

output

Perplexity: 12.34
Accuracy: 0

Common variations

You can evaluate LLMs asynchronously or with streaming for large datasets. Use different models like claude-3-5-sonnet-20241022 for better coding benchmarks or gemini-1.5-pro for multimodal tasks. Human evaluation remains essential for qualitative metrics like fluency and relevance.

python

import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def evaluate_async():
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Summarize the following text."}],
        max_tokens=50
    )
    print(response.choices[0].message.content)

asyncio.run(evaluate_async())

output

This is a summary of the provided text...

Troubleshooting

If you see unusually high perplexity values, check that logprobs are correctly extracted and that the prompt is well-formed. For low accuracy, verify expected outputs and consider using multiple test examples. API rate limits or key errors require checking environment variables and usage quotas.

✅

Key Takeaways

Use perplexity to measure how well an LLM predicts text sequences.
Combine automated metrics with human evaluation for best results.
Leverage SDK async and streaming features for large-scale evaluation.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, gemini-1.5-pro

Verify ↗