text-embedding-3-small vs text-embedding-3-large: embedding model comparison
Use text-embedding-3-small if you have cost constraints or need fast inference with acceptable accuracy (99% of use cases). Use text-embedding-3-large if you need maximum semantic accuracy for complex retrieval-augmented generation or specialized domain tasks.
VERDICT
Side-by-side comparison
| Feature | text-embedding-3-small | text-embedding-3-large | Winner |
|---|---|---|---|
| Embedding dimensions | 512 | 3072 | Tie (domain-dependent) |
| Cost per 1M tokens | $0.02 | $0.13 | text-embedding-3-small |
| Inference latency (p50) | ~8ms | ~12ms | text-embedding-3-small |
| Vector DB storage (per 1M vectors) | ~2GB | ~12GB | text-embedding-3-small |
| MTEB score (retrieval tasks) | ~62.3 | ~64.2 | text-embedding-3-large |
| Max context length | 8,191 tokens | 8,191 tokens | Tie |
| Quality on standard benchmarks | 99.4% of large's performance | Baseline (100%) | text-embedding-3-large |
| Throughput (GPU, batch=32) | ~4,000 vectors/sec | ~1,500 vectors/sec | text-embedding-3-small |
Performance benchmarks
MTEB Retrieval Score (average across 15 datasets)
1.9-point difference translates to ~0.6% higher recall on standard retrieval tasks. Small remains competitive for most real-world RAG applications.
Cost per 1M tokens (as of April 2026)
6.5x cost difference. For 10B tokens/month, small costs $200 vs $1,300 for large. Compounds significantly at scale.
Vector storage (per 1M embedded documents)
Large requires 6x more vector DB storage and bandwidth. Impacts Pinecone, Weaviate, and self-hosted vector retrieval costs.
Inference latency (single embedding, CPU decode)
Batch processing reduces per-token latency significantly for both. Latency difference less relevant in async RAG pipelines.
When to use each
- ✓ Standard RAG applications (documents + Q&A): small outperforms on cost-to-quality ratio. Benchmark on your data, but expect <1% quality loss vs large.
- ✓ High-throughput retrieval: small handles 3-4x more concurrent embedding requests at same GPU cost. Use for real-time search on millions of documents.
- ✓ Cost-sensitive deployments: embedding 10B tokens/month? Small saves $1,100/month vs large. Reinvest savings in better retrieval logic or reranking.
- ✓ Vector DB storage constraints: small uses 6x less space. Critical for self-hosted deployments or edge scenarios with storage limits.
- ✓ Fine-tuning or domain adaptation: use small as base model for continued training. Lower dimensionality = faster fine-tuning + less GPU memory.
- ✓ Specialized domain retrieval: legal documents, medical abstracts, or scientific papers where 2% recall improvement matters. Verify with your benchmark first.
- ✓ Semantic similarity at scale: if you're doing extensive clustering or similarity comparisons where nuance is monetizable, large's richer representation pays for itself.
- ✓ Multi-lingual or technical embeddings: larger dimensionality helps capture code, mixed-language content, and domain terminology with fewer false positives.
- ✓ When cost is truly unconstrained: AI research teams, enterprise search where retrieval accuracy directly impacts revenue, or mission-critical RAG.
- ✓ Vector space visualization or interpretation: 3072-dim embeddings retain more structure for t-SNE/UMAP visualization and interpretability research.
Common misconceptions
text-embedding-3-small
text-embedding-3-small is a 'lite' or 'fast' version that sacrifices quality: it's only for prototypes.
small uses the same transformer architecture and training as large. The 512-dim output is dimensionality reduction by design, not underfitting. On MTEB retrieval, it scores 99.4% of large. Use in production immediately.
smaller embeddings mean worse rare-word or out-of-vocabulary handling.
both models use the same 100K-token vocabulary and training data. Dimensionality doesn't affect OOV handling: it only affects the granularity of the vector space. small handles rare words as well as large.
you need to store small's embeddings in a different vector DB or use a different distance metric.
both output normalized vectors compatible with Cosine, L2, and Dot Product distance. Pinecone, Weaviate, Milvus all handle 512-dim and 3072-dim identically. No code changes needed.
text-embedding-3-large
text-embedding-3-large is always better, so you should use it by default.
1.9-point MTEB improvement doesn't translate to 1.9% better real-world recall. On your data, small may equal or beat large depending on your query/document distribution. Always benchmark before committing to large's 6.5x cost.
more dimensions = better embeddings, period.
3072 dims add noise and computational cost if your retrieval task doesn't need that resolution. For short queries vs short documents (e.g., product search), small's 512 dims may be optimal. Benchmark on your specific task.
you can switch from small to large seamlessly if accuracy drops.
switching requires re-embedding your entire corpus (vector DB rebuilding). If you have 100M documents, re-embedding costs 100B tokens × $0.13 = $13,000. Test on a sample corpus first, not in production.
Code examples
Task: Embed a text query and retrieve the 3 most similar documents from a pre-embedded corpus using OpenAI's embedding API.
import os
from openai import OpenAI
import numpy as np
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Pre-embedded documents (stored in vector DB)
docs = [
"Python is a programming language",
"The quick brown fox jumps over the lazy dog",
"Machine learning models require training data",
]
# Embed documents with text-embedding-3-small (512 dims)
doc_embeddings = client.embeddings.create(
model="text-embedding-3-small", # 512 dimensions, $0.02/1M tokens
input=docs
).data
# Embed query
query = "What is Python?"
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=[query]
).data[0].embedding
# Compute cosine similarity (in production, use vector DB)
similarities = [
np.dot(query_embedding, np.array(doc_emb.embedding))
for doc_emb in doc_embeddings
]
top_3_indices = np.argsort(similarities)[-3:][::-1]
for idx in top_3_indices:
print(f"Doc: {docs[idx]}, Score: {similarities[idx]:.4f}") text-embedding-3-small produces 512-dim vectors at $0.02/1M tokens. Query embeddings reuse the same model call, enabling efficient similarity matching for RAG and search applications.
import os
from openai import OpenAI
import numpy as np
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Pre-embedded documents (stored in vector DB)
docs = [
"Python is a programming language",
"The quick brown fox jumps over the lazy dog",
"Machine learning models require training data",
]
# Embed documents with text-embedding-3-large (3072 dims)
doc_embeddings = client.embeddings.create(
model="text-embedding-3-large", # 3072 dimensions, $0.13/1M tokens
input=docs
).data
# Embed query
query = "What is Python?"
query_embedding = client.embeddings.create(
model="text-embedding-3-large",
input=[query]
).data[0].embedding
# Compute cosine similarity (in production, use vector DB)
similarities = [
np.dot(query_embedding, np.array(doc_emb.embedding))
for doc_emb in doc_embeddings
]
top_3_indices = np.argsort(similarities)[-3:][::-1]
for idx in top_3_indices:
print(f"Doc: {docs[idx]}, Score: {similarities[idx]:.4f}") text-embedding-3-large produces 3072-dim vectors at $0.13/1M tokens (6.5x costlier). API signature is identical to small; only the model name and resulting dimensionality differ. Vector DB retrieval logic unchanged.
Migration path
- Switching between text-embedding-3-small and text-embedding-3-large requires two steps:
- Change the model parameter from 'text-embedding-3-small' to 'text-embedding-3-large' in your client.embeddings.create() call.
- Re-embed your entire corpus and rebuild your vector DB (no code changes, but operational cost: 100M docs × $0.13/1M - $0.02/1M = $11,000 extra). The API calls are 100% compatible: same client, same distance metrics in Pinecone/Weaviate/Milvus. Before switching, benchmark both models on a 10k-document sample of your domain to verify the 1.9-point MTEB difference justifies 6.5x cost on your specific task.
RECOMMENDATION