ValueError: space mismatch
chromadb.errors.InvalidQueryException or ValueError (HNSW space dimension/metric mismatch)
Stack trace
chromadb.errors.InvalidQueryException: Error: space type mismatch. Index was created with space 'cosine', but query attempted with space 'euclidean'
Or:
ValueError: Embedding dimension mismatch: expected 1536 dimensions, got 768. HNSW index built with 1536-dim embeddings cannot query with 768-dim embeddings.
Traceback (most recent call last):
File "app.py", line 45, in query_documents
results = collection.query(query_embeddings=query_vec, n_results=5)
File "chromadb/api/client.py", line 234, in query
return self._client._query(self._name, query_embeddings, n_results, where_filter)
File "chromadb/db/impl.py", line 187, in _query
raise InvalidQueryException(f'space type mismatch. Index was created with space {self.metadata["space"]}, but query attempted with space {space}') Why it happens
ChromaDB's HNSW (Hierarchical Navigable Small World) index is a metric-specific data structure. When you create a collection, you specify a space (cosine, euclidean, or ip). The entire index graph is built using that metric to compute distances between vectors. If you later query with embeddings from a different model (different dimensions) or if your collection's metadata says 'cosine' but your add/query code uses 'euclidean', HNSW cannot compute distances correctly and raises this error. Additionally, embedding model changes (e.g., switching from 1536-dim GPT embeddings to 768-dim alternatives) cause dimension mismatches that HNSW detects and rejects.
Detection
Before querying, log your collection's metadata to verify the space setting. Print the embedding dimension of your query vectors before passing them to collection.query(). Add assertions that embedding dimensions match your index creation step: `assert len(query_vec[0]) == expected_dim, f'Expected {expected_dim} dims, got {len(query_vec[0])}'`. Monitor ChromaDB's collection.metadata() to catch silent space mismatches early.
Causes & fixes
Collection created with space='cosine' but code queries with space='euclidean' (or vice versa)
Ensure all collection.add(), collection.query(), and collection creation calls use the SAME space parameter. Set space='cosine' (default and recommended for embeddings) consistently. Store the space choice in environment variables or config files to prevent manual mismatches.
Embedding model changed (e.g., from OpenAI text-embedding-3-large 1536-dim to 768-dim), but index built with old dimensions
Recreate the collection with a new name using delete_collection() then create_collection() with embeddings from the new model. Never try to add new 768-dim embeddings to an index built with 1536-dim data. Validate embedding dimension before any add/query: `assert all(len(e) == 1536 for e in embeddings)`
Using different embedding functions at different code stages (e.g., creating with ChromaDB's default embedding, querying with custom embedding function)
Always pass the same embedding_function to both collection.get() and to chroma_client.create_collection(). If using custom embeddings, instantiate the function once and reuse it: `embedder = OpenAIEmbeddingFunction(model_name='text-embedding-3-small'); client.create_collection(embeddings=embedder, space='cosine')`
Metadata corruption or old ChromaDB version stored space='ip' (inner product) but collection actually uses cosine distance
Check collection.metadata() output. If space field is missing or wrong, backup your data, drop the collection, and recreate it: `collection = client.delete_collection(name='my_collection'); collection = client.create_collection(name='my_collection', space='cosine', embeddings=embedder)`
Code: broken vs fixed
import chromadb
import os
from openai import OpenAI
# Collection created with space='cosine'
client = chromadb.Client()
collection = client.create_collection(
name='documents',
space='cosine' # ← Index built with cosine metric
)
# Add embeddings
openai_client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
embeddings = openai_client.embeddings.create(
input=['hello world'],
model='text-embedding-3-small'
)
collection.add(
ids=['doc1'],
embeddings=[embeddings.data[0].embedding],
documents=['hello world']
)
# Query with DIFFERENT space parameter — this breaks
query_vec = openai_client.embeddings.create(
input=['hello'],
model='text-embedding-3-small'
)
# ❌ BUG: querying with 'euclidean' but index built with 'cosine'
results = collection.query(
query_embeddings=[query_vec.data[0].embedding],
n_results=3,
space='euclidean' # ← MISMATCH — raises error
)
print(results) import chromadb
import os
from openai import OpenAI
# Collection created with space='cosine'
client = chromadb.Client()
collection = client.create_collection(
name='documents',
space='cosine' # ← Define metric once
)
# Add embeddings
openai_client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
embeddings = openai_client.embeddings.create(
input=['hello world'],
model='text-embedding-3-small'
)
# Verify embedding dimension
expected_dim = 1536
assert len(embeddings.data[0].embedding) == expected_dim, f'Expected {expected_dim} dims'
collection.add(
ids=['doc1'],
embeddings=[embeddings.data[0].embedding],
documents=['hello world']
)
# Query with SAME space parameter
query_vec = openai_client.embeddings.create(
input=['hello'],
model='text-embedding-3-small'
)
# Verify space matches
print(f"Collection space: {collection.metadata()['space']}")
assert len(query_vec.data[0].embedding) == expected_dim, f'Query dim mismatch'
# ✅ FIXED: use 'cosine' consistently (or omit space parameter to use default)
results = collection.query(
query_embeddings=[query_vec.data[0].embedding],
n_results=3
# space='cosine' # ← Optional: already set at collection creation
)
print(f"Found {len(results['ids'][0])} results")
for doc_id, doc_text in zip(results['ids'][0], results['documents'][0]):
print(f"{doc_id}: {doc_text}") Workaround
If you cannot recreate the collection immediately, extract the vector data, embeddings, and metadata from the old index using collection.get(), delete the collection, create a new one with correct space='cosine', and re-add all data. Alternatively, keep two collections: one for each metric (documents_cosine and documents_euclidean), and route queries to the correct collection based on your query embedding model. This is temporary: plan a full migration to a consistent space metric within 1-2 sprints.
Prevention
Store your space metric choice in environment variables (CHROMA_SPACE=cosine) or a config file, not as hardcoded strings in code. Create a wrapper function that instantiates collections with validated parameters: `def create_doc_collection(client, space=os.environ.get('CHROMA_SPACE', 'cosine')):`. Use a single, immutable embedding model across your pipeline: switching models requires a full collection rebuild, not just a code change. Add pre-query validation: check that embeddings and collection space agree before calling query(). Use ChromaDB's backup/restore (collection.export()) before any schema changes.