Weights and Biases for LLM evaluation
Quick answer
Use the
wandb Python package to track and log large language model (LLM) evaluation metrics such as loss, accuracy, and custom scores. Integrate wandb with your LLM inference code by initializing a project, logging metrics during evaluation, and finishing the run to visualize results in the wandb dashboard.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install wandb
Setup
Install the required packages and configure your environment variables for OPENAI_API_KEY and WANDB_API_KEY. Initialize a wandb project to track your LLM evaluation runs.
pip install openai wandb Step by step
This example demonstrates how to evaluate an LLM using the OpenAI SDK and log evaluation metrics such as loss and accuracy to wandb. It runs a simple prompt completion and logs the results.
import os
import wandb
from openai import OpenAI
# Initialize wandb project
wandb.init(project="llm-evaluation", entity="your-entity")
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define evaluation prompts and expected outputs
prompts = [
"Translate 'Hello' to French.",
"What is the capital of Germany?",
"Summarize the following text: AI is transforming industries."
]
expected_outputs = ["Bonjour", "Berlin", "AI is changing many industries."]
correct = 0
for i, prompt in enumerate(prompts):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip()
# Simple accuracy check (exact match)
is_correct = output.lower() == expected_outputs[i].lower()
correct += int(is_correct)
# Log each prediction and correctness
wandb.log({
"prompt": prompt,
"output": output,
"expected": expected_outputs[i],
"correct": is_correct
})
# Log overall accuracy
accuracy = correct / len(prompts)
wandb.log({"accuracy": accuracy})
print(f"Evaluation accuracy: {accuracy:.2f}")
# Finish wandb run
wandb.finish() output
Evaluation accuracy: 0.67
Common variations
You can extend wandb integration by logging additional metrics like token-level loss, perplexity, or custom evaluation scores. For async or streaming LLM calls, log metrics incrementally. You can also switch models by changing the model parameter in the OpenAI SDK call.
import asyncio
import os
import wandb
from openai import OpenAI
async def async_eval():
wandb.init(project="llm-evaluation-async")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Explain quantum computing in simple terms."
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip()
wandb.log({"prompt": prompt, "output": output})
print(output)
wandb.finish()
asyncio.run(async_eval()) output
Quantum computing uses quantum bits to perform complex calculations faster than classical computers.
Troubleshooting
- If
wandbfails to log, ensure yourWANDB_API_KEYenvironment variable is set and you are logged in viawandb login. - If you see API errors from OpenAI, verify your
OPENAI_API_KEYis valid and has sufficient quota. - For slow logging, batch your
wandb.log()calls or use asynchronous logging.
Key Takeaways
- Use
wandbto track LLM evaluation metrics and visualize results in real time. - Log both individual predictions and aggregate metrics like accuracy for comprehensive analysis.
- Ensure environment variables for API keys are set to avoid authentication errors.