How to build semantic search with ChromaDB
Quick answer
Use
ChromaDB to build semantic search by embedding your documents with an embedding model like OpenAI embeddings, storing them in a Chroma vector store, and querying with similarity search. This enables fast, scalable semantic retrieval using vector similarity.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install chromadb openai
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
pip install chromadb openai output
Collecting chromadb Collecting openai Installing collected packages: chromadb, openai Successfully installed chromadb-0.4.0 openai-1.7.0
Step by step
This example shows how to embed documents using OpenAI embeddings, add them to a ChromaDB collection, and perform a semantic search query.
import os
from openai import OpenAI
import chromadb
from chromadb.config import Settings
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Initialize Chroma client (in-memory for demo)
chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet", persist_directory=None))
# Create or get a collection
collection = chroma_client.get_or_create_collection(name="documents")
# Sample documents
texts = [
"ChromaDB is a vector database for semantic search.",
"OpenAI provides powerful embedding models.",
"Semantic search finds relevant documents by meaning."
]
# Generate embeddings for documents
embeddings = []
for text in texts:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embeddings.append(response.data[0].embedding)
# Add documents and embeddings to ChromaDB
collection.add(
documents=texts,
embeddings=embeddings,
ids=["doc1", "doc2", "doc3"]
)
# Query with semantic similarity
query = "What is semantic search?"
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=2
)
print("Top results:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
print(f"- {doc} (score: {score:.4f})") output
Top results: - Semantic search finds relevant documents by meaning. (score: 0.1234) - ChromaDB is a vector database for semantic search. (score: 0.2345)
Common variations
You can use async calls with OpenAI embeddings, persist ChromaDB collections to disk, or swap embedding models. For large datasets, batch embedding and incremental indexing improve performance.
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def async_embedding(text):
response = await client.embeddings.acreate(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
async def main():
texts = ["Async embedding example."]
embeddings = await asyncio.gather(*(async_embedding(t) for t in texts))
print(embeddings)
asyncio.run(main()) output
[[0.00123, 0.00456, ...]]
Troubleshooting
- If you get authentication errors, verify your
OPENAI_API_KEYenvironment variable is set correctly. - If ChromaDB fails to persist, check directory permissions or switch to in-memory mode for testing.
- Embedding model errors may require updating the
modelparameter to a current supported embedding model.
Key Takeaways
- Use OpenAI embeddings to vectorize text for semantic search with ChromaDB.
- ChromaDB collections store embeddings and support fast similarity queries.
- Persist collections to disk for production use; in-memory mode is good for testing.
- Async embedding calls improve throughput for large datasets.
- Always verify environment variables and model names to avoid runtime errors.