How to Intermediate · 4 min read

How to deploy semantic search API

Quick answer

Deploy a semantic search API by generating vector embeddings with a model like text-embedding-3-small, storing them in a vector database such as FAISS or Chroma, and querying with similarity search. Use the OpenAI Python SDK to create embeddings and a lightweight web framework like FastAPI to serve the API.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 faiss-cpu fastapi uvicorn

Setup

Install required Python packages and set your OpenAI API key as an environment variable.

Install packages: openai for embeddings, faiss-cpu for vector search, and fastapi with uvicorn for the API server.
Set environment variable: export OPENAI_API_KEY='your_api_key' on Linux/macOS or use system environment settings on Windows.

bash

pip install openai faiss-cpu fastapi uvicorn

output

Collecting openai
Collecting faiss-cpu
Collecting fastapi
Collecting uvicorn
Successfully installed openai faiss-cpu fastapi uvicorn

Step by step

This example shows how to create embeddings for documents, store them in FAISS, and deploy a FastAPI server to query semantic search results.

python

import os
from openai import OpenAI
import faiss
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents to index
documents = [
    "The Eiffel Tower is in Paris.",
    "Python is a popular programming language.",
    "OpenAI develops advanced AI models.",
    "FastAPI is great for building APIs."
]

# Generate embeddings for documents
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents
)
embeddings = np.array([data.embedding for data in response.data], dtype=np.float32)

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Map index to documents
id_to_doc = {i: doc for i, doc in enumerate(documents)}

# FastAPI app
app = FastAPI()

class QueryRequest(BaseModel):
    query: str
    top_k: int = 3

@app.post("/search")
def search(request: QueryRequest):
    # Generate embedding for query
    query_resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=[request.query]
    )
    query_embedding = np.array(query_resp.data[0].embedding, dtype=np.float32).reshape(1, -1)

    # Search FAISS index
    distances, indices = index.search(query_embedding, request.top_k)

    results = []
    for dist, idx in zip(distances[0], indices[0]):
        results.append({"document": id_to_doc[idx], "distance": float(dist)})

    return {"query": request.query, "results": results}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

output

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

# Example POST request to http://localhost:8000/search with JSON body:
# {"query": "Where is the Eiffel Tower?", "top_k": 2}

# Response:
# {
#   "query": "Where is the Eiffel Tower?",
#   "results": [
#     {"document": "The Eiffel Tower is in Paris.", "distance": 0.0023},
#     {"document": "OpenAI develops advanced AI models.", "distance": 1.2345}
#   ]
# }

Common variations

You can adapt this semantic search API by:

Using Chroma or FAISS GPU for scalable vector storage.
Switching to async FastAPI endpoints for higher throughput.
Using different embedding models like text-embedding-3-large for better accuracy.
Adding metadata filtering or hybrid search combining keyword and vector search.

python

from fastapi import FastAPI
import asyncio

app = FastAPI()

@app.post("/search-async")
async def search_async(request: QueryRequest):
    # Async call to OpenAI embeddings
    query_resp = await client.embeddings.acreate(
        model="text-embedding-3-small",
        input=[request.query]
    )
    query_embedding = np.array(query_resp.data[0].embedding, dtype=np.float32).reshape(1, -1)
    distances, indices = index.search(query_embedding, request.top_k)
    results = [{"document": id_to_doc[idx], "distance": float(dist)} for dist, idx in zip(distances[0], indices[0])]
    return {"query": request.query, "results": results}

output

INFO:     Started server process [12346]
INFO:     Uvicorn running on http://0.0.0.0:8000

# Async endpoint supports concurrent requests efficiently.

Troubleshooting

API key errors: Ensure OPENAI_API_KEY is set correctly in your environment.
Embedding dimension mismatch: Confirm the embedding model output dimension matches your vector index dimension.
Slow search: Use approximate nearest neighbor indexes like faiss.IndexIVFFlat for large datasets.
Server errors: Check FastAPI logs and ensure dependencies are installed.

✅

Key Takeaways

Use OpenAI embeddings to convert text into vectors for semantic search.
Store embeddings in a vector database like FAISS for efficient similarity queries.
Deploy a lightweight API server with FastAPI to serve semantic search requests.
Async endpoints and scalable vector stores improve performance for production use.
Validate environment variables and embedding dimensions to avoid common errors.

Verified 2026-04 · text-embedding-3-small, text-embedding-3-large

Verify ↗