How to use ChromaDB with Haystack
Quick answer
Use
Chroma from langchain_community.vectorstores as the vector store in a Haystack pipeline. Load documents, embed them with OpenAIEmbeddings, and connect Chroma to InMemoryRetriever or BaseRetriever for semantic search.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install haystack-ai langchain_community openai chromadb
Setup
Install the required packages and set your OpenAI API key in the environment variables.
pip install haystack-ai langchain_community openai chromadb Step by step
This example shows how to load documents, create embeddings with OpenAI, store them in ChromaDB, and query using Haystack's retriever.
import os
from haystack import Pipeline
from haystack.nodes import PromptNode, PromptTemplate
from haystack.nodes import BaseRetriever
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever
# Set your OpenAI API key in environment
# export OPENAI_API_KEY="your_api_key"
# Sample documents
documents = [
{"content": "ChromaDB is a fast vector database for embeddings."},
{"content": "Haystack is a framework for building search systems."},
{"content": "OpenAI provides powerful embedding models."}
]
# Initialize InMemoryDocumentStore
document_store = InMemoryDocumentStore(use_bm25=False)
document_store.write_documents(documents)
# Create OpenAI embeddings client
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=os.environ["OPENAI_API_KEY"])
# Create Chroma vector store from documents
texts = [doc["content"] for doc in documents]
# Initialize Chroma with texts and embeddings
chroma = Chroma.from_texts(
texts=texts,
embedding=embeddings,
collection_name="haystack_chroma_collection"
)
# Create a retriever using Chroma
retriever = chroma.as_retriever(search_kwargs={"k": 2})
# Build a simple Haystack pipeline with retriever
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
# Query the pipeline
query = "What is ChromaDB?"
result = pipeline.run(query=query)
print("Top documents:")
for doc in result["documents"]:
print(f"- {doc.content}") output
Top documents: - ChromaDB is a fast vector database for embeddings. - OpenAI provides powerful embedding models.
Common variations
- Use
EmbeddingRetrieverfrom Haystack withChromaas the vector store for more advanced retrieval options. - Switch to async calls by using async-compatible Haystack components and async OpenAI clients.
- Use different embedding models like
text-embedding-3-largeor custom embeddings.
Troubleshooting
- If you get
ModuleNotFoundErrorforchromadb, ensure you installedchromadbpackage. - If embeddings are empty or retrieval returns no results, verify your OpenAI API key is set correctly in
os.environ["OPENAI_API_KEY"]. - For large document sets, consider persisting Chroma collections to disk to avoid re-indexing on every run.
Key Takeaways
- Use
Chromafromlangchain_community.vectorstoresas a vector store backend in Haystack pipelines. - Generate embeddings with
OpenAIEmbeddingsand store them in Chroma for fast semantic search. - Integrate Chroma retriever into Haystack's
Pipelinefor flexible query handling. - Ensure environment variables for API keys are set to avoid authentication errors.
- Persist Chroma collections for scalability and faster startup in production.