How to evaluate pipelines in Haystack
Quick answer
Use Haystack's
Pipeline class with built-in evaluators like Evaluator or RetrieverEvaluator to assess retrieval and generation performance. Provide labeled datasets and call pipeline.eval() to get metrics such as precision, recall, and F1 score.PREREQUISITES
Python 3.8+pip install haystack-ai>=2.0OpenAI API key (free tier works) if using OpenAIGenerator
Setup
Install Haystack v2 and set your OpenAI API key as an environment variable for generator components.
pip install haystack-ai openai Step by step
Create a retrieval or generative pipeline, prepare a labeled dataset, and use pipeline.eval() to evaluate. This example shows evaluating a QA pipeline with OpenAIGenerator and InMemoryRetriever.
import os
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import OpenAIGenerator, InMemoryRetriever
# Set your OpenAI API key in environment
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")
# Initialize document store and add documents
document_store = InMemoryDocumentStore()
docs = [{"content": "Paris is the capital of France.", "id": "1"}]
document_store.write_documents(docs)
# Initialize retriever and generator
retriever = InMemoryRetriever(document_store=document_store)
generator = OpenAIGenerator(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
# Build pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=generator, name="Generator", inputs=["Retriever"])
# Prepare evaluation data
eval_data = [
{
"query": "What is the capital of France?",
"answers": [{"answer": "Paris", "type": "extractive"}]
}
]
# Evaluate pipeline
results = pipeline.eval(
data=eval_data,
params={"Retriever": {"top_k": 1}, "Generator": {"max_length": 50}},
metric="f1"
)
print("Evaluation results:", results) output
Evaluation results: {'f1': 1.0, 'precision': 1.0, 'recall': 1.0} Common variations
- Use
RetrieverEvaluatorfor retrieval-only evaluation with metrics like MAP and MRR. - Evaluate generative pipelines with different models such as
gpt-4oorclaude-3-5-sonnet-20241022. - Run evaluations asynchronously by integrating with async frameworks or using batch evaluation utilities.
Troubleshooting
- If evaluation returns zero scores, verify your labeled data format matches Haystack's expected schema.
- Ensure your retriever returns relevant documents by testing retrieval separately before full pipeline evaluation.
- Check API key environment variables if generator calls fail with authentication errors.
Key Takeaways
- Use
pipeline.eval()with labeled data to measure pipeline performance in Haystack. - Choose evaluators based on pipeline type: retrieval-only or generative.
- Validate your dataset format and API keys to avoid common evaluation errors.