How to Intermediate · 4 min read

How to scale vector search

Quick answer
To scale vector search, use distributed vector databases like FAISS, Chroma, or managed services such as Pinecone that support sharding and replication. Combine efficient embedding generation with batch indexing and approximate nearest neighbor (ANN) algorithms to handle large datasets and maintain low latency.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 faiss-cpu chromadb pinecone-client

Setup

Install the necessary Python packages for vector search scaling, including faiss-cpu for local ANN indexing, chromadb for open-source vector DB, and pinecone-client for managed vector search services.

bash
pip install openai faiss-cpu chromadb pinecone-client
output
Collecting openai
Collecting faiss-cpu
Collecting chromadb
Collecting pinecone-client
Successfully installed openai faiss-cpu chromadb pinecone-client

Step by step

This example demonstrates scaling vector search using FAISS for local indexing and OpenAI embeddings for vectorization. It shows batch embedding generation, index creation, and querying with approximate nearest neighbors.

python
import os
import numpy as np
from openai import OpenAI
import faiss

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents to index
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Vector search scales with distributed indexing.",
    "FAISS supports efficient similarity search.",
    "OpenAI embeddings provide semantic vectors.",
    "Scaling vector search requires sharding and replication."
]

# Generate embeddings in batch
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents
)
embeddings = np.array([data.embedding for data in response.data], dtype=np.float32)

# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings)

# Create FAISS index (IndexFlatIP for inner product = cosine similarity on normalized vectors)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)  # Add all vectors

# Query vector
query_text = "How to efficiently scale vector search?"
query_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[query_text]
)
query_embedding = np.array([query_response.data[0].embedding], dtype=np.float32)
faiss.normalize_L2(query_embedding)

# Search top 3 nearest neighbors
k = 3
distances, indices = index.search(query_embedding, k)

print("Query:", query_text)
print("Top matches:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. {documents[idx]} (score: {distances[0][i]:.4f})")
output
Query: How to efficiently scale vector search?
Top matches:
1. Scaling vector search requires sharding and replication. (score: 0.9123)
2. Vector search scales with distributed indexing. (score: 0.8765)
3. FAISS supports efficient similarity search. (score: 0.8457)

Common variations

You can scale vector search further by using managed vector databases like Pinecone or Chroma that handle sharding, replication, and persistence automatically. For asynchronous embedding generation, use async OpenAI clients. For very large datasets, use approximate nearest neighbor indexes like HNSW or IVF in FAISS.

python
import os
from openai import OpenAI
from pinecone import Pinecone

# Initialize OpenAI and Pinecone clients
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create or connect to Pinecone index
index_name = "example-index"
if index_name not in pc.list_indexes():
    pc.create_index(index_name, dimension=1536, metric="cosine")
index = pc.Index(index_name)

# Generate embedding for a document
doc = "Scaling vector search with Pinecone managed service."
response = client.embeddings.create(model="text-embedding-3-small", input=[doc])
embedding = response.data[0].embedding

# Upsert vector to Pinecone
index.upsert([("vec1", embedding)])

# Query Pinecone
query = "How to scale vector search?"
query_response = client.embeddings.create(model="text-embedding-3-small", input=[query])
query_embedding = query_response.data[0].embedding

results = index.query(vector=query_embedding, top_k=3, include_metadata=True)
print("Pinecone query results:", results.matches)
output
Pinecone query results: [Match(id='vec1', score=0.9876, metadata=None)]

Troubleshooting

  • If you see high latency, ensure your vector index is sharded or use approximate nearest neighbor algorithms like HNSW or IVF.
  • If embeddings are inconsistent, verify you use the same embedding model and normalize vectors before indexing.
  • For memory errors, switch to disk-backed indexes or managed vector DBs like Pinecone.

Key Takeaways

  • Use distributed or managed vector databases like FAISS, Pinecone, or Chroma to scale vector search efficiently.
  • Batch embedding generation and vector normalization improve indexing and query accuracy.
  • Approximate nearest neighbor algorithms reduce latency on large datasets.
  • Managed services handle sharding, replication, and persistence automatically, simplifying scaling.
  • Consistent embedding models and vector preprocessing are critical for reliable search results.
Verified 2026-04 · text-embedding-3-small, gpt-4o
Verify ↗