Dense retrieval in Haystack explained
Quick answer
Dense retrieval in Haystack uses vector embeddings to semantically match queries with documents, enabling more accurate search than keyword matching. It typically involves embedding documents and queries with models like OpenAIEmbeddings and indexing them in vector stores such as FAISS for fast similarity search.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install haystack-ai openai faiss-cpu
Setup
Install haystack-ai along with openai and faiss-cpu for vector indexing. Set your OpenAI API key as an environment variable.
- Install packages:
pip install haystack-ai openai faiss-cpu - Export your API key:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) or set in your environment variables on Windows.
pip install haystack-ai openai faiss-cpu Step by step
This example shows how to create a dense retrieval pipeline in Haystack using OpenAIEmbeddings and FAISS as the vector store. It loads documents, embeds them, indexes them, and performs a semantic search query.
import os
from haystack import Pipeline
from haystack_community.document_loaders import TextLoader
from haystack_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# Load documents
loader = TextLoader("example_docs.txt")
docs = loader.load()
# Initialize embeddings with OpenAI
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
# Create FAISS vector store and index documents
vectorstore = FAISS.from_documents(docs, embeddings)
# Build retrieval pipeline
pipeline = Pipeline()
pipeline.add_node(component=vectorstore.as_retriever(), name="Retriever", inputs=["Query"])
# Query the pipeline
query = "What is dense retrieval?"
result = pipeline.run(query=query)
print("Top documents:")
for doc in result["documents"]:
print(f"- {doc.content[:200]}...") output
Top documents: - Dense retrieval is a technique that uses dense vector embeddings to represent documents and queries, enabling semantic search... - Unlike sparse retrieval, dense retrieval captures semantic similarity by embedding text into continuous vector spaces...
Common variations
You can customize dense retrieval in Haystack by:
- Using different embedding models like
sentence-transformersorOpenAIvariants. - Switching vector stores to
Chroma,Weaviate, orPineconefor scalability. - Implementing asynchronous queries or streaming results in advanced pipelines.
Troubleshooting
If you see empty search results, ensure your documents are properly loaded and embeddings are generated without errors. Check your OPENAI_API_KEY environment variable is set correctly. For indexing issues, verify FAISS is installed and compatible with your system architecture.
Key Takeaways
- Dense retrieval uses vector embeddings for semantic search, outperforming keyword matching.
- Haystack integrates OpenAI embeddings with FAISS for efficient dense retrieval pipelines.
- You can swap embedding models and vector stores to fit your scalability and accuracy needs.