How are embeddings used in RAG
RAG, embeddings convert documents and queries into dense vector representations that capture semantic meaning. These vectors enable fast similarity search to retrieve relevant documents, which the LLM then uses to generate accurate, context-aware answers.RAG is like giving an AI a textbook to look things up in rather than asking it to recall everything from memory — retrieval finds the pages, then the LLM reads and answers.
The core mechanism
Embeddings transform text into fixed-length vectors that represent semantic meaning in a high-dimensional space. In RAG, both the knowledge base documents and user queries are embedded. The system then performs a vector similarity search to find documents closest to the query vector. These retrieved documents provide relevant context for the LLM to generate informed responses, effectively combining retrieval and generation.
For example, a 1536-dimensional embedding vector might represent a paragraph, and cosine similarity measures how close two vectors are, indicating semantic relevance.
Step by step
- User inputs a question: "What causes rainbows?"
- The question is converted into an embedding vector.
- The system searches the document embeddings index for the top 3 closest vectors.
- The corresponding documents about light refraction and rainbows are retrieved.
- The
LLMreceives the question plus retrieved documents as context. - The
LLMgenerates a detailed answer using both its knowledge and the retrieved info.
| Step | Action | Example Output |
|---|---|---|
| 1 | Embed query | [0.12, -0.34, ..., 0.56] (1536-d vector) |
| 2 | Search index | Top 3 docs with cosine similarity > 0.85 |
| 3 | Retrieve docs | "Light refraction causes rainbows..." |
| 4 | Generate answer | "Rainbows form when sunlight refracts..." |
Concrete example
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Step 1: Embed the query
query = "What causes rainbows?"
embedding_response = client.embeddings.create(
model="text-embedding-3-large",
input=query
)
query_vector = embedding_response.data[0].embedding
# Step 2: Assume we have a FAISS index of document embeddings
# Here we simulate a search returning top 2 docs (normally use FAISS or similar)
top_docs = [
"Rainbows are caused by light refraction and dispersion in water droplets.",
"Sunlight bends when passing through raindrops, creating a spectrum of colors."
]
# Step 3: Use retrieved docs as context for LLM
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Question: {query}\nContext: {top_docs[0]} {top_docs[1]}"}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print(response.choices[0].message.content) Rainbows form when sunlight passes through raindrops, bending and splitting into a spectrum of colors due to refraction and dispersion.
Common misconceptions
People often think RAG means the LLM "remembers" all facts internally, but actually it relies on external documents retrieved via embeddings. Another misconception is that embeddings are just keywords; in reality, they capture deep semantic meaning, enabling retrieval of relevant info even if exact words differ.
Why it matters for building AI apps
Using embeddings in RAG allows developers to build AI systems that scale knowledge dynamically without retraining the LLM. It enables up-to-date, accurate answers by searching large document collections efficiently. This approach reduces hallucinations and improves user trust in AI applications.
Key Takeaways
-
Embeddingsconvert text into vectors that capture semantic meaning for similarity search. -
RAGuses embeddings to retrieve relevant documents that provide context forLLMgeneration. - This retrieval step improves accuracy and scalability by grounding AI responses in external knowledge.