What embeddings are: text as vectors
Why this matters
Embeddings are the foundation for semantic search, recommendation systems, and retrieval-augmented generation (RAG). Without understanding how text becomes vectors, you can't build production search or clustering features that actually understand meaning rather than just keyword matching.
Explanation
The OpenAI Embeddings API takes text (a word, sentence, or document) and returns a vector of floating-point numbers that represent that text's meaning in a mathematical space. The text-embedding-3-small model produces 1536-dimensional vectors; text-embedding-3-large produces 3072 dimensions. These aren't random: they're learned from billions of text examples so that semantically similar texts end up close to each other in that vector space.
Under the hood, the API sends your text to OpenAI's servers where a neural embedding model encodes it. The model has learned that "dog" and "puppy" should have similar vectors, while "dog" and "refrigerator" should be far apart. You can measure this distance using cosine similarity or Euclidean distance. This is why embeddings power semantic search: you embed the user's query and your documents once, then find the documents whose vectors are closest to the query vector.
Use embeddings when you need semantic understanding: search, clustering, duplicate detection, or feeding context into an LLM. Embed data once and store vectors in a vector database (Pinecone, Weaviate, Postgres with pgvector). Never embed the same text twice: it's wasteful and expensive. For a typical production system, you embed during data ingestion, then compare at query time.
Request code
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="The quick brown fox jumps over the lazy dog"
)
print(f"Embedding dimension: {len(response.data[0].embedding)}")
print(f"First 10 values: {response.data[0].embedding[:10]}")
print(f"Total tokens used: {response.usage.total_tokens}") Authentication
The OpenAI SDK automatically reads your API key from the OPENAI_API_KEY environment variable when you instantiate the client. Ensure your key has access to the Embeddings API (it should by default). Set the key in your shell before running Python: `export OPENAI_API_KEY='sk-...'` or pass it directly: `client = OpenAI(api_key='sk-...')`.
Response shape
| Field | Description |
|---|---|
object | list (the response type identifier) |
data | [object Object] |
model | text-embedding-3-small |
usage | [object Object] |
Field guide
embedding The vector itself: a list of floats. Store this in your vector database. This is your semantic fingerprint of the text.
index Position of this embedding in your input batch (0 if you sent one text, 1 if you sent the second item in a batch of 10). Critical when batching: tells you which embedding corresponds to which input.
model Confirms which embedding model was used. Always verify this matches what you requested.
usage.prompt_tokens The number of tokens your input text consumed. OpenAI charges per token, so track this. A typical sentence uses 5-15 tokens.
Setup trap
If you set `os.environ['OPENAI_API_KEY']` after already instantiating `OpenAI()`, it won't work. The SDK reads the environment variable at client initialization time. Initialize the client *after* your environment is set up, or pass the key directly to the constructor.
Cost
As of April 2026, text-embedding-3-small costs $0.02 per 1M input tokens. text-embedding-3-large costs $0.13 per 1M input tokens. A typical document of 500 words is ~750 tokens. Embedding 1 million documents with the small model costs ~$15. Cache your embeddings: don't re-embed the same text.
Rate limits
OpenAI allows up to 3,000 requests per minute on embeddings endpoints (Pro tier gets higher limits). If you're embedding millions of documents, use batch processing or space requests over time. The API handles batch `input` lists well: send up to 2,048 texts per request to reduce API calls.
Common gotcha
Developers often treat embeddings as deterministic within a version, but they're not guaranteed to be bitwise identical across API calls: tiny floating-point differences are normal and won't break cosine similarity. The real gotcha: batching. If you send `input=["text1", "text2", "text3"]`, you get one response with three embeddings in the `data` array. It's easy to index wrong and assign the wrong vector to the wrong text. Always verify `response.data[i].index == i`.
Error recovery
AuthenticationErrorRateLimitErrorAPIConnectionErrorInvalidRequestError with 'model'Experienced dev note
The biggest mistake senior devs make: storing embeddings without tracking the model version. Six months later, you have 10 million embeddings from text-embedding-3-small, your company switches to a different embedding model for better quality, and now you have to re-embed everything. Store the model name and embedding creation timestamp with every vector. Also, cosine similarity ranges from -1 to 1, but OpenAI's embeddings are normalized (most values near 0), so similarity scores are typically 0.7-0.95 for real-world text. Don't use raw Euclidean distance on these: always use cosine similarity.
Check your understanding
You have 100,000 product descriptions you want to make searchable by meaning. Should you call the embeddings API once per description during data ingestion, or call it for every search query? Why?
Show answer hint
Think about cost and latency. The API charges per token every time you call it. If you embed during ingestion and store vectors, searches are free (just vector math). If you embed on every query, you pay for every single search.