How to Intermediate · 3 min read

How to choose the best embedding model for RAG

Quick answer
Choose the best embedding model for RAG by balancing semantic accuracy, vector dimensionality, and latency. Use models like openai-embedding-3-large for high-quality semantic search or lighter models for faster, cost-effective retrieval depending on your dataset size and query complexity.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to access embedding models.

bash
pip install openai>=1.0

Step by step

This example shows how to generate embeddings using the openai-embedding-3-large model for RAG vector search.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

texts = [
    "OpenAI develops advanced AI models.",
    "Retrieval-Augmented Generation improves LLM responses.",
    "Embeddings convert text into vectors for similarity search."
]

response = client.embeddings.create(
    model="openai-embedding-3-large",
    input=texts
)

embeddings = [e.embedding for e in response.data]
for i, emb in enumerate(embeddings):
    print(f"Embedding vector {i} length: {len(emb)}")
output
Embedding vector 0 length: 1536
Embedding vector 1 length: 1536
Embedding vector 2 length: 1536

Common variations

You can choose smaller embedding models like openai-embedding-3-small for faster, cheaper embeddings with lower dimensionality (e.g., 384). For large-scale RAG, consider batch processing and async calls to optimize throughput.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def create_embeddings_async(texts):
    response = await client.embeddings.acreate(
        model="openai-embedding-3-small",
        input=texts
    )
    return [e.embedding for e in response.data]

texts = ["Fast embeddings for RAG.", "Async calls improve throughput."]
embeddings = asyncio.run(create_embeddings_async(texts))
print(f"Received {len(embeddings)} embeddings asynchronously.")
output
Received 2 embeddings asynchronously.

Troubleshooting

If embeddings have inconsistent lengths or errors occur, verify your model name and API key. Also, ensure input text is not empty or too long (max tokens vary by model). For latency issues, switch to smaller embedding models or batch requests.

Key Takeaways

  • Use high-dimensional embeddings like openai-embedding-3-large for best semantic accuracy in RAG.
  • Smaller models reduce cost and latency but may sacrifice retrieval quality.
  • Batch and async embedding calls improve performance on large datasets.
  • Validate input text length and model compatibility to avoid errors.
  • Match embedding dimensionality with your vector store capabilities for optimal search.
Verified 2026-04 · openai-embedding-3-large, openai-embedding-3-small
Verify ↗