How to Intermediate · 4 min read

How to deploy semantic search API

Quick answer
Deploy a semantic search API by generating vector embeddings with a model like text-embedding-3-small, storing them in a vector database such as FAISS or Chroma, and querying with similarity search. Use the OpenAI Python SDK to create embeddings and a lightweight web framework like FastAPI to serve the API.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 faiss-cpu fastapi uvicorn

Setup

Install required Python packages and set your OpenAI API key as an environment variable.

  • Install packages: openai for embeddings, faiss-cpu for vector search, and fastapi with uvicorn for the API server.
  • Set environment variable: export OPENAI_API_KEY='your_api_key' on Linux/macOS or use system environment settings on Windows.
bash
pip install openai faiss-cpu fastapi uvicorn
output
Collecting openai
Collecting faiss-cpu
Collecting fastapi
Collecting uvicorn
Successfully installed openai faiss-cpu fastapi uvicorn

Step by step

This example shows how to create embeddings for documents, store them in FAISS, and deploy a FastAPI server to query semantic search results.

python
import os
from openai import OpenAI
import faiss
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents to index
documents = [
    "The Eiffel Tower is in Paris.",
    "Python is a popular programming language.",
    "OpenAI develops advanced AI models.",
    "FastAPI is great for building APIs."
]

# Generate embeddings for documents
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents
)
embeddings = np.array([data.embedding for data in response.data], dtype=np.float32)

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Map index to documents
id_to_doc = {i: doc for i, doc in enumerate(documents)}

# FastAPI app
app = FastAPI()

class QueryRequest(BaseModel):
    query: str
    top_k: int = 3

@app.post("/search")
def search(request: QueryRequest):
    # Generate embedding for query
    query_resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=[request.query]
    )
    query_embedding = np.array(query_resp.data[0].embedding, dtype=np.float32).reshape(1, -1)

    # Search FAISS index
    distances, indices = index.search(query_embedding, request.top_k)

    results = []
    for dist, idx in zip(distances[0], indices[0]):
        results.append({"document": id_to_doc[idx], "distance": float(dist)})

    return {"query": request.query, "results": results}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
output
INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

# Example POST request to http://localhost:8000/search with JSON body:
# {"query": "Where is the Eiffel Tower?", "top_k": 2}

# Response:
# {
#   "query": "Where is the Eiffel Tower?",
#   "results": [
#     {"document": "The Eiffel Tower is in Paris.", "distance": 0.0023},
#     {"document": "OpenAI develops advanced AI models.", "distance": 1.2345}
#   ]
# }

Common variations

You can adapt this semantic search API by:

  • Using Chroma or FAISS GPU for scalable vector storage.
  • Switching to async FastAPI endpoints for higher throughput.
  • Using different embedding models like text-embedding-3-large for better accuracy.
  • Adding metadata filtering or hybrid search combining keyword and vector search.
python
from fastapi import FastAPI
import asyncio

app = FastAPI()

@app.post("/search-async")
async def search_async(request: QueryRequest):
    # Async call to OpenAI embeddings
    query_resp = await client.embeddings.acreate(
        model="text-embedding-3-small",
        input=[request.query]
    )
    query_embedding = np.array(query_resp.data[0].embedding, dtype=np.float32).reshape(1, -1)
    distances, indices = index.search(query_embedding, request.top_k)
    results = [{"document": id_to_doc[idx], "distance": float(dist)} for dist, idx in zip(distances[0], indices[0])]
    return {"query": request.query, "results": results}
output
INFO:     Started server process [12346]
INFO:     Uvicorn running on http://0.0.0.0:8000

# Async endpoint supports concurrent requests efficiently.

Troubleshooting

  • API key errors: Ensure OPENAI_API_KEY is set correctly in your environment.
  • Embedding dimension mismatch: Confirm the embedding model output dimension matches your vector index dimension.
  • Slow search: Use approximate nearest neighbor indexes like faiss.IndexIVFFlat for large datasets.
  • Server errors: Check FastAPI logs and ensure dependencies are installed.

Key Takeaways

  • Use OpenAI embeddings to convert text into vectors for semantic search.
  • Store embeddings in a vector database like FAISS for efficient similarity queries.
  • Deploy a lightweight API server with FastAPI to serve semantic search requests.
  • Async endpoints and scalable vector stores improve performance for production use.
  • Validate environment variables and embedding dimensions to avoid common errors.
Verified 2026-04 · text-embedding-3-small, text-embedding-3-large
Verify ↗