How to Intermediate · 3 min read

How to evaluate DSPy program quality

Q: How to evaluate DSPy program quality

To evaluate dspy program quality, use structured testing by defining clear input-output signatures and validating outputs against expected results. Employ debugging and logging within dspy chains to trace execution and ensure correctness.

Quick answer

To evaluate dspy program quality, use structured testing by defining clear input-output signatures and validating outputs against expected results. Employ debugging and logging within dspy chains to trace execution and ensure correctness.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install dspy openai>=1.0

Setup

Install dspy and openai packages and set your OpenAI API key as an environment variable.

bash

pip install dspy openai>=1.0

Step by step

Define a dspy.Signature with input and output fields, create a dspy.Predict instance, and run predictions. Validate the output by comparing it to expected values or using assertions.

python

import os
import dspy
from openai import OpenAI

# Configure the LM with OpenAI GPT-4o-mini
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

# Define a signature for a simple QA task
class QA(dspy.Signature):
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

# Create a prediction module
qa = dspy.Predict(QA)

# Run a test input
result = qa(question="What is DSPy?")
print("Answer:", result.answer)

# Evaluate quality by asserting expected content
expected_keywords = ["declarative", "programming", "AI"]
assert all(keyword in result.answer.lower() for keyword in expected_keywords), "Output missing expected keywords"
print("Evaluation passed: output contains all expected keywords.")

output

Answer: DSPy is a declarative programming framework for AI that simplifies building and evaluating language model programs.
Evaluation passed: output contains all expected keywords.

Common variations

You can evaluate dspy programs asynchronously using asyncio or test with different models by changing the LM model name. For streaming outputs, integrate with OpenAI's streaming API and capture partial results.

python

import asyncio

async def async_test():
    result = await qa.async_predict(question="Explain RAG in AI.")
    print("Async answer:", result.answer)

asyncio.run(async_test())

output

Async answer: RAG stands for Retrieval-Augmented Generation, a technique combining retrieval with language generation to improve accuracy.

Troubleshooting

If outputs are inconsistent, increase max_tokens or adjust temperature in the LM configuration.
Use logging inside dspy chains to trace intermediate values.
Ensure your API key is correctly set in os.environ["OPENAI_API_KEY"] to avoid authentication errors.

✅

Key Takeaways

Use explicit input-output signatures in dspy to enable structured evaluation.
Validate outputs with assertions or keyword checks to ensure program correctness.
Leverage async and streaming variants for flexible testing scenarios.
Enable logging and adjust LM parameters to troubleshoot quality issues.

Verified 2026-04 · gpt-4o-mini

Verify ↗