How to evaluate DSPy program quality
Quick answer
To evaluate
dspy program quality, use structured testing by defining clear input-output signatures and validating outputs against expected results. Employ debugging and logging within dspy chains to trace execution and ensure correctness.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install dspy openai>=1.0
Setup
Install dspy and openai packages and set your OpenAI API key as an environment variable.
pip install dspy openai>=1.0 Step by step
Define a dspy.Signature with input and output fields, create a dspy.Predict instance, and run predictions. Validate the output by comparing it to expected values or using assertions.
import os
import dspy
from openai import OpenAI
# Configure the LM with OpenAI GPT-4o-mini
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)
# Define a signature for a simple QA task
class QA(dspy.Signature):
question: str = dspy.InputField()
answer: str = dspy.OutputField()
# Create a prediction module
qa = dspy.Predict(QA)
# Run a test input
result = qa(question="What is DSPy?")
print("Answer:", result.answer)
# Evaluate quality by asserting expected content
expected_keywords = ["declarative", "programming", "AI"]
assert all(keyword in result.answer.lower() for keyword in expected_keywords), "Output missing expected keywords"
print("Evaluation passed: output contains all expected keywords.") output
Answer: DSPy is a declarative programming framework for AI that simplifies building and evaluating language model programs. Evaluation passed: output contains all expected keywords.
Common variations
You can evaluate dspy programs asynchronously using asyncio or test with different models by changing the LM model name. For streaming outputs, integrate with OpenAI's streaming API and capture partial results.
import asyncio
async def async_test():
result = await qa.async_predict(question="Explain RAG in AI.")
print("Async answer:", result.answer)
asyncio.run(async_test()) output
Async answer: RAG stands for Retrieval-Augmented Generation, a technique combining retrieval with language generation to improve accuracy.
Troubleshooting
- If outputs are inconsistent, increase
max_tokensor adjust temperature in theLMconfiguration. - Use logging inside
dspychains to trace intermediate values. - Ensure your API key is correctly set in
os.environ["OPENAI_API_KEY"]to avoid authentication errors.
Key Takeaways
- Use explicit input-output signatures in
dspyto enable structured evaluation. - Validate outputs with assertions or keyword checks to ensure program correctness.
- Leverage async and streaming variants for flexible testing scenarios.
- Enable logging and adjust LM parameters to troubleshoot quality issues.