Explained Intermediate · 4 min read

How are embeddings used in RAG

Q: How are embeddings used in RAG

In RAG, embeddings convert documents and queries into dense vector representations that capture semantic meaning. These vectors enable fast similarity search to retrieve relevant documents, which the LLM then uses to generate accurate, context-aware answers.

Quick answer

In RAG, embeddings convert documents and queries into dense vector representations that capture semantic meaning. These vectors enable fast similarity search to retrieve relevant documents, which the LLM then uses to generate accurate, context-aware answers.

💡

RAG is like giving an AI a textbook to look things up in rather than asking it to recall everything from memory — retrieval finds the pages, then the LLM reads and answers.

The core mechanism

Embeddings transform text into fixed-length vectors that represent semantic meaning in a high-dimensional space. In RAG, both the knowledge base documents and user queries are embedded. The system then performs a vector similarity search to find documents closest to the query vector. These retrieved documents provide relevant context for the LLM to generate informed responses, effectively combining retrieval and generation.

For example, a 1536-dimensional embedding vector might represent a paragraph, and cosine similarity measures how close two vectors are, indicating semantic relevance.

Step by step

User inputs a question: "What causes rainbows?"
The question is converted into an embedding vector.
The system searches the document embeddings index for the top 3 closest vectors.
The corresponding documents about light refraction and rainbows are retrieved.
The LLM receives the question plus retrieved documents as context.
The LLM generates a detailed answer using both its knowledge and the retrieved info.

Step	Action	Example Output
1	Embed query	[0.12, -0.34, ..., 0.56] (1536-d vector)
2	Search index	Top 3 docs with cosine similarity > 0.85
3	Retrieve docs	"Light refraction causes rainbows..."
4	Generate answer	"Rainbows form when sunlight refracts..."

Concrete example

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Embed the query
query = "What causes rainbows?"
embedding_response = client.embeddings.create(
    model="text-embedding-3-large",
    input=query
)
query_vector = embedding_response.data[0].embedding

# Step 2: Assume we have a FAISS index of document embeddings
# Here we simulate a search returning top 2 docs (normally use FAISS or similar)
top_docs = [
    "Rainbows are caused by light refraction and dispersion in water droplets.",
    "Sunlight bends when passing through raindrops, creating a spectrum of colors."
]

# Step 3: Use retrieved docs as context for LLM
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"Question: {query}\nContext: {top_docs[0]} {top_docs[1]}"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print(response.choices[0].message.content)

output

Rainbows form when sunlight passes through raindrops, bending and splitting into a spectrum of colors due to refraction and dispersion.

Common misconceptions

People often think RAG means the LLM "remembers" all facts internally, but actually it relies on external documents retrieved via embeddings. Another misconception is that embeddings are just keywords; in reality, they capture deep semantic meaning, enabling retrieval of relevant info even if exact words differ.

Why it matters for building AI apps

Using embeddings in RAG allows developers to build AI systems that scale knowledge dynamically without retraining the LLM. It enables up-to-date, accurate answers by searching large document collections efficiently. This approach reduces hallucinations and improves user trust in AI applications.

✅

Key Takeaways

Embeddings convert text into vectors that capture semantic meaning for similarity search.
RAG uses embeddings to retrieve relevant documents that provide context for LLM generation.
This retrieval step improves accuracy and scalability by grounding AI responses in external knowledge.

Verified 2026-04 · gpt-4o, text-embedding-3-large

Verify ↗