How to create vector store in OpenAI API
Quick answer
To create a vector store with the
OpenAI API, first generate embeddings using the client.embeddings.create method with a model like text-embedding-3-large. Then store these vectors in a vector database such as FAISS or Chroma for efficient similarity search and retrieval.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install faiss-cpu or chromadb
Setup
Install the openai Python SDK and a vector store library like faiss-cpu or chromadb. Set your OpenAI API key as an environment variable.
- Install OpenAI SDK:
pip install openai - Install FAISS (CPU version):
pip install faiss-cpu - Or install ChromaDB:
pip install chromadb - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai faiss-cpu Step by step
This example shows how to generate embeddings with the OpenAI API and store them in a FAISS vector store for similarity search.
import os
from openai import OpenAI
import faiss
import numpy as np
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample documents to embed
documents = [
"OpenAI develops advanced AI models.",
"Vector stores enable fast similarity search.",
"FAISS is a popular vector database library."
]
# Generate embeddings
response = client.embeddings.create(
model="text-embedding-3-large",
input=documents
)
embeddings = [e.embedding for e in response.data]
# Convert to numpy array
embedding_matrix = np.array(embeddings).astype('float32')
# Create FAISS index
dimension = embedding_matrix.shape[1]
index = faiss.IndexFlatL2(dimension) # L2 distance
index.add(embedding_matrix)
# Query example
query = "AI models and vector search"
query_response = client.embeddings.create(
model="text-embedding-3-large",
input=[query]
)
query_embedding = np.array(query_response.data[0].embedding).astype('float32')
# Search top 2 similar documents
k = 2
D, I = index.search(np.array([query_embedding]), k)
print("Top matches:")
for i, dist in zip(I[0], D[0]):
print(f"Document: {documents[i]} (Distance: {dist:.4f})") output
Top matches: Document: OpenAI develops advanced AI models. (Distance: 0.1234) Document: Vector stores enable fast similarity search. (Distance: 0.2345)
Common variations
You can use other vector stores like chromadb or pinecone for scalable cloud storage. Also, you can generate embeddings asynchronously or use different embedding models such as text-embedding-3-small. For large datasets, batch your embedding requests.
import os
from openai import OpenAI
import chromadb
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Initialize Chroma client and collection
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")
documents = ["Example document 1", "Example document 2"]
# Generate embeddings
response = client.embeddings.create(
model="text-embedding-3-large",
input=documents
)
embeddings = [e.embedding for e in response.data]
# Add to Chroma collection
collection.add(documents=documents, embeddings=embeddings, ids=["doc1", "doc2"])
# Query
query = "search example"
query_response = client.embeddings.create(
model="text-embedding-3-large",
input=[query]
)
query_embedding = query_response.data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=2)
print(results) output
Returns matching documents with distances from ChromaDB
Troubleshooting
- API key errors: Ensure
OPENAI_API_KEYis set correctly in your environment. - Embedding model errors: Use a valid embedding model like
text-embedding-3-large. - FAISS installation issues: On some platforms, install FAISS via conda or use
faiss-cpufor CPU-only support. - Rate limits: Batch embedding requests to avoid hitting API rate limits.
Key Takeaways
- Use
client.embeddings.createwith a suitable embedding model to generate vectors. - Store embeddings in a vector database like FAISS or Chroma for efficient similarity search.
- Batch embedding requests and handle API keys securely via environment variables.