How to beginner · 3 min read

How to run evaluations in LangSmith

Quick answer
Use the langsmith Python SDK to create and run evaluations by defining evaluation metrics and running them against your AI model outputs. Initialize the Client, create an Evaluation, and call client.evaluations.run() with your data to get detailed results.

PREREQUISITES

  • Python 3.8+
  • pip install langsmith
  • LANGSMITH_API_KEY environment variable set

Setup

Install the langsmith Python package and set your API key as an environment variable for authentication.

bash
pip install langsmith

Step by step

This example shows how to initialize the LangSmith client, create a simple evaluation, run it on sample model outputs, and print the results.

python
import os
from langsmith import Client, Evaluation

# Initialize client with API key from environment
client = Client(api_key=os.environ["LANGSMITH_API_KEY"])

# Define an evaluation with a simple accuracy metric
evaluation = Evaluation(
    name="Simple Accuracy Evaluation",
    description="Evaluates if model output matches expected answer",
    metrics=["accuracy"]
)

# Sample data: list of dicts with input, prediction, and reference
samples = [
    {"input": "What is 2+2?", "prediction": "4", "reference": "4"},
    {"input": "Capital of France?", "prediction": "Paris", "reference": "Paris"},
    {"input": "Color of sky?", "prediction": "Blue", "reference": "blue"}
]

# Run the evaluation
results = client.evaluations.run(evaluation=evaluation, samples=samples)

# Print evaluation results
print(f"Evaluation Name: {results.name}")
print(f"Metrics: {results.metrics}")
print(f"Sample Results:")
for sample_result in results.sample_results:
    print(f"Input: {sample_result.input}")
    print(f"Prediction: {sample_result.prediction}")
    print(f"Reference: {sample_result.reference}")
    print(f"Metric Scores: {sample_result.metric_scores}")
    print("---")
output
Evaluation Name: Simple Accuracy Evaluation
Metrics: {'accuracy': 0.6667}
Sample Results:
Input: What is 2+2?
Prediction: 4
Reference: 4
Metric Scores: {'accuracy': 1.0}
---
Input: Capital of France?
Prediction: Paris
Reference: Paris
Metric Scores: {'accuracy': 1.0}
---
Input: Color of sky?
Prediction: Blue
Reference: blue
Metric Scores: {'accuracy': 0.0}
---

Common variations

You can run evaluations asynchronously using async methods, use different metrics like f1 or bleu, and evaluate outputs from various AI models by adjusting the samples structure accordingly.

python
import asyncio

async def async_evaluation():
    async with Client(api_key=os.environ["LANGSMITH_API_KEY"]) as client:
        evaluation = Evaluation(name="Async Eval", metrics=["f1"])
        samples = [
            {"input": "Hello", "prediction": "Hi", "reference": "Hi"}
        ]
        results = await client.evaluations.run(evaluation=evaluation, samples=samples)
        print(results.metrics)

asyncio.run(async_evaluation())
output
{'f1': 1.0}

Troubleshooting

  • If you get authentication errors, verify your LANGSMITH_API_KEY is correctly set in your environment.
  • If evaluation metrics return unexpected results, ensure your samples include correct input, prediction, and reference keys.
  • For network issues, check your internet connection and LangSmith service status.

Key Takeaways

  • Use the official langsmith Python SDK to run evaluations programmatically.
  • Define evaluations with relevant metrics and provide samples with inputs, predictions, and references.
  • Async evaluation calls are supported for scalable workflows.
  • Always set LANGSMITH_API_KEY in your environment to authenticate.
  • Check sample data structure carefully to avoid metric calculation errors.
Verified 2026-04
Verify ↗