What is hybrid search in vector databases
vector databases combines vector similarity search with traditional keyword-based search to improve retrieval accuracy. It leverages both semantic embeddings and exact text matches for more relevant results.How it works
Hybrid search integrates two search methods: vector similarity search and keyword-based search. Vector search uses embeddings to find semantically similar items, while keyword search matches exact terms. Combining them allows retrieval systems to capture both semantic meaning and precise keyword matches, improving overall relevance. Imagine searching a library by both book topics (vector) and exact titles or authors (keyword).
Concrete example
Here is a Python example using OpenAI embeddings and a simple hybrid search combining cosine similarity and keyword filtering:
import os
import numpy as np
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample documents
documents = [
{"id": 1, "text": "Deep learning advances AI capabilities."},
{"id": 2, "text": "Vector databases store embeddings."},
{"id": 3, "text": "Keyword search finds exact matches."}
]
# Query
query_text = "AI and vector search"
# Get embedding for query
response = client.embeddings.create(
model="text-embedding-3-small",
input=query_text
)
query_embedding = np.array(response.data[0].embedding)
# Compute embeddings for documents
doc_embeddings = []
for doc in documents:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=doc["text"]
)
doc_embeddings.append(np.array(resp.data[0].embedding))
# Simple cosine similarity function
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Hybrid search: filter docs containing keyword 'vector' then rank by similarity
keyword = "vector"
filtered_docs = [doc for doc in documents if keyword in doc["text"].lower()]
filtered_embeddings = [doc_embeddings[documents.index(doc)] for doc in filtered_docs]
scores = [cosine_sim(query_embedding, emb) for emb in filtered_embeddings]
results = sorted(zip(filtered_docs, scores), key=lambda x: x[1], reverse=True)
for doc, score in results:
print(f"Doc ID: {doc['id']}, Score: {score:.4f}, Text: {doc['text']}") Doc ID: 2, Score: 0.8723, Text: Vector databases store embeddings.
When to use it
Use hybrid search when you need both semantic understanding and exact keyword matching in retrieval tasks. It is ideal for applications like enterprise search, e-commerce product search, and knowledge bases where users expect precise keyword hits plus conceptually related results. Avoid hybrid search if your dataset is small or if only semantic or keyword search alone suffices.
Key terms
| Term | Definition |
|---|---|
| Vector search | Retrieval based on similarity of vector embeddings representing semantic meaning. |
| Keyword search | Retrieval based on exact matching of text tokens or terms. |
| Embedding | Numerical vector representing text or data semantics. |
| Cosine similarity | Metric measuring angle similarity between two vectors. |
| Hybrid search | Combining vector similarity and keyword matching for retrieval. |
Key Takeaways
- Hybrid search improves retrieval by combining semantic and exact keyword matching.
- Use hybrid search in applications needing both concept understanding and precise term hits.
- Implement hybrid search by filtering with keywords then ranking by vector similarity.
- Vector embeddings capture meaning, keywords ensure exact matches.
- Hybrid search balances recall and precision in vector databases.