How to beginner · 3 min read

Weights and Biases for LLM evaluation

Q: Weights and Biases for LLM evaluation

Use the wandb Python package to track and log large language model (LLM) evaluation metrics such as loss, accuracy, and custom scores. Integrate wandb with your LLM inference code by initializing a project, logging metrics during evaluation, and finishing the run to visualize results in the wandb dashboard.

Quick answer

Use the wandb Python package to track and log large language model (LLM) evaluation metrics such as loss, accuracy, and custom scores. Integrate wandb with your LLM inference code by initializing a project, logging metrics during evaluation, and finishing the run to visualize results in the wandb dashboard.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install wandb

Setup

Install the required packages and configure your environment variables for OPENAI_API_KEY and WANDB_API_KEY. Initialize a wandb project to track your LLM evaluation runs.

bash

pip install openai wandb

Step by step

This example demonstrates how to evaluate an LLM using the OpenAI SDK and log evaluation metrics such as loss and accuracy to wandb. It runs a simple prompt completion and logs the results.

python

import os
import wandb
from openai import OpenAI

# Initialize wandb project
wandb.init(project="llm-evaluation", entity="your-entity")

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define evaluation prompts and expected outputs
prompts = [
    "Translate 'Hello' to French.",
    "What is the capital of Germany?",
    "Summarize the following text: AI is transforming industries."
]
expected_outputs = ["Bonjour", "Berlin", "AI is changing many industries."]

correct = 0

for i, prompt in enumerate(prompts):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    output = response.choices[0].message.content.strip()
    # Simple accuracy check (exact match)
    is_correct = output.lower() == expected_outputs[i].lower()
    correct += int(is_correct)

    # Log each prediction and correctness
    wandb.log({
        "prompt": prompt,
        "output": output,
        "expected": expected_outputs[i],
        "correct": is_correct
    })

# Log overall accuracy
accuracy = correct / len(prompts)
wandb.log({"accuracy": accuracy})

print(f"Evaluation accuracy: {accuracy:.2f}")

# Finish wandb run
wandb.finish()

output

Evaluation accuracy: 0.67

Common variations

You can extend wandb integration by logging additional metrics like token-level loss, perplexity, or custom evaluation scores. For async or streaming LLM calls, log metrics incrementally. You can also switch models by changing the model parameter in the OpenAI SDK call.

python

import asyncio
import os
import wandb
from openai import OpenAI

async def async_eval():
    wandb.init(project="llm-evaluation-async")
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    prompt = "Explain quantum computing in simple terms."

    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )

    output = response.choices[0].message.content.strip()
    wandb.log({"prompt": prompt, "output": output})
    print(output)
    wandb.finish()

asyncio.run(async_eval())

output

Quantum computing uses quantum bits to perform complex calculations faster than classical computers.

Troubleshooting

If wandb fails to log, ensure your WANDB_API_KEY environment variable is set and you are logged in via wandb login.
If you see API errors from OpenAI, verify your OPENAI_API_KEY is valid and has sufficient quota.
For slow logging, batch your wandb.log() calls or use asynchronous logging.

✅

Key Takeaways

Use wandb to track LLM evaluation metrics and visualize results in real time.
Log both individual predictions and aggregate metrics like accuracy for comprehensive analysis.
Ensure environment variables for API keys are set to avoid authentication errors.

Verified 2026-04 · gpt-4o-mini

Verify ↗