Haystack pipeline explained
Quick answer
A
Haystack pipeline is a modular workflow that connects document stores, retrievers, and generators to perform tasks like question answering. It orchestrates components such as InMemoryDocumentStore, BM25Retriever, and OpenAIGenerator to retrieve relevant documents and generate answers from them.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install haystack-ai openai
Setup
Install the haystack-ai package and set your OpenAI API key as an environment variable.
- Install Haystack and OpenAI SDK:
pip install haystack-ai openai Step by step
This example creates an in-memory document store, adds documents, sets up a BM25 retriever, and uses the OpenAI generator to answer a query.
import os
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
# Set your OpenAI API key in environment variable before running
# export OPENAI_API_KEY="your_api_key"
# Initialize document store
document_store = InMemoryDocumentStore()
# Write sample documents
docs = [
{"content": "Haystack is an open-source NLP framework for building search systems."},
{"content": "It supports retrievers and generators for question answering."}
]
document_store.write_documents(docs)
# Initialize retriever
retriever = InMemoryBM25Retriever(document_store=document_store)
# Initialize generator with OpenAI GPT-4o-mini
generator = OpenAIGenerator(api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o-mini")
# Build pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=generator, name="Generator", inputs=["Retriever"])
# Run pipeline
query = "What is Haystack?"
result = pipeline.run(query=query)
print("Answer:", result["answers"][0].answer) output
Answer: Haystack is an open-source NLP framework for building search systems that supports retrievers and generators for question answering.
Common variations
- Use different retrievers like
DensePassageRetrieverfor semantic search. - Replace
OpenAIGeneratorwith other generators likeOpenAIChatGeneratororTransformersGenerator. - Use external document stores such as
FAISSDocumentStoreorElasticsearchDocumentStorefor scalability. - Run pipelines asynchronously or stream results for real-time applications.
Troubleshooting
- If you get authentication errors, verify your
OPENAI_API_KEYenvironment variable is set correctly. - If no answers are returned, ensure documents are properly written to the document store.
- For slow responses, consider using smaller models or caching retriever results.
Key Takeaways
- Haystack pipelines connect retrievers and generators to build powerful QA systems.
- Use
InMemoryDocumentStoreandBM25Retrieverfor simple setups. - OpenAI models like
gpt-4o-minican be used as generators in Haystack. - Switch components easily for semantic search or scalable document storage.
- Always set your API keys via environment variables to avoid authentication issues.