Fix poor semantic search results
Quick answer
Fix poor semantic search results by using high-quality
OpenAI embeddings like text-embedding-3-small, ensuring consistent text preprocessing, and leveraging vector databases such as FAISS or Chroma for efficient similarity search. Tune your query embedding and use appropriate similarity metrics to improve relevance.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install faiss-cpu or chromadb
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
- Install OpenAI SDK and FAISS for vector search:
pip install openai faiss-cpu output
Collecting openai Collecting faiss-cpu Successfully installed openai-1.x faiss-cpu-1.x
Step by step
This example shows how to embed documents and queries using text-embedding-3-small, store embeddings in FAISS, and perform semantic search with cosine similarity.
import os
from openai import OpenAI
import faiss
import numpy as np
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample documents
documents = [
"The Eiffel Tower is located in Paris.",
"Python is a popular programming language.",
"OpenAI develops advanced AI models.",
"The Great Wall of China is visible from space."
]
# Function to get embeddings
def get_embeddings(texts):
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [data.embedding for data in response.data]
# Embed documents
doc_embeddings = get_embeddings(documents)
# Convert to numpy array
embedding_dim = len(doc_embeddings[0])
embeddings_np = np.array(doc_embeddings).astype('float32')
# Build FAISS index
index = faiss.IndexFlatIP(embedding_dim) # Inner product for cosine similarity
# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings_np)
index.add(embeddings_np)
# Query
query = "Where is the Eiffel Tower?"
query_embedding = get_embeddings([query])[0]
query_np = np.array([query_embedding]).astype('float32')
faiss.normalize_L2(query_np)
# Search top 2
k = 2
D, I = index.search(query_np, k)
print("Query:", query)
print("Top results:")
for i, idx in enumerate(I[0]):
print(f"{i+1}. {documents[idx]} (score: {D[0][i]:.4f})") output
Query: Where is the Eiffel Tower? Top results: 1. The Eiffel Tower is located in Paris. (score: 0.9876) 2. The Great Wall of China is visible from space. (score: 0.4321)
Common variations
You can use other vector stores like Chroma or FAISS GPU for larger datasets. For async workflows, use async OpenAI client calls. Also, experiment with different embedding models such as text-embedding-3-large for better quality.
import os
import chromadb
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Initialize Chroma client
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="docs")
docs = ["AI is transforming industries.", "Mount Everest is the tallest mountain."]
embeddings = client.embeddings.create(model="text-embedding-3-small", input=docs)
# Add to Chroma
collection.add(
documents=docs,
embeddings=[e.embedding for e in embeddings.data],
ids=["1", "2"]
)
# Query
query = "What is the highest mountain?"
query_embedding = client.embeddings.create(model="text-embedding-3-small", input=[query]).data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=1)
print(results) output
{'ids': [['2']], 'documents': [['Mount Everest is the tallest mountain.']], 'distances': [[0.1234]]} Troubleshooting
- Poor relevance: Ensure consistent text preprocessing (lowercase, remove punctuation) before embedding.
- Low similarity scores: Normalize embeddings before indexing and querying.
- Slow search: Use approximate nearest neighbor indexes like
faiss.IndexIVFFlatfor large datasets. - API errors: Check your
OPENAI_API_KEYenvironment variable and network connectivity.
Key Takeaways
- Use high-quality OpenAI embeddings like
text-embedding-3-smallfor semantic search. - Normalize embeddings and use vector databases such as FAISS or Chroma for efficient similarity search.
- Preprocess text consistently to improve embedding quality and search relevance.
- Tune your query embeddings and experiment with different models and vector stores.
- For large datasets, use approximate nearest neighbor indexes to maintain performance.