How to Intermediate · 4 min read

How to build RAG with Azure OpenAI

Quick answer
Build Retrieval-Augmented Generation (RAG) with AzureOpenAI by combining a vector store for document retrieval and AzureOpenAI chat completions for generation. Use AzureOpenAI SDK to query your deployed model with retrieved context to produce accurate, context-aware answers.

PREREQUISITES

  • Python 3.8+
  • Azure OpenAI resource with deployment name
  • Azure OpenAI API key and endpoint
  • pip install openai>=1.0
  • pip install faiss-cpu (or another vector store)

Setup

Install the required Python packages and set environment variables for your Azure OpenAI API key and endpoint. You will also need a vector store like FAISS for document retrieval.

bash
pip install openai faiss-cpu

Step by step

This example shows how to build a simple RAG system by embedding documents, storing them in FAISS, retrieving relevant documents for a query, and then using Azure OpenAI to generate an answer based on the retrieved context.

python
import os
from openai import AzureOpenAI
import faiss
import numpy as np

# Initialize Azure OpenAI client
client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01"
)

# Example documents to index
documents = [
    "Python is a popular programming language.",
    "Azure OpenAI provides powerful AI models.",
    "Retrieval-Augmented Generation combines search and generation.",
    "FAISS is a library for efficient similarity search."
]

# Step 1: Embed documents using Azure OpenAI embeddings
embedding_model = "text-embedding-3-large"

def embed_text(texts):
    response = client.embeddings.create(model=embedding_model, input=texts)
    return np.array([item.embedding for item in response.data], dtype=np.float32)

embeddings = embed_text(documents)

# Step 2: Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Step 3: Query embedding
query = "What is RAG?"
query_embedding = embed_text([query])

# Step 4: Retrieve top 2 relevant documents
k = 2
_, indices = index.search(query_embedding, k)
retrieved_docs = [documents[i] for i in indices[0]]

# Step 5: Build prompt with retrieved context
context = "\n".join(retrieved_docs)
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {query}\nAnswer:"

# Step 6: Generate answer with Azure OpenAI chat completion
response = client.chat.completions.create(
    model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
    messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content

print("Question:", query)
print("Answer:", answer)
output
Question: What is RAG?
Answer: Retrieval-Augmented Generation (RAG) is a technique that combines document retrieval with language model generation to provide accurate and context-aware answers.

Common variations

  • Use async calls with asyncio and await for scalable applications.
  • Switch embedding models or chat models by changing embedding_model or model parameters.
  • Integrate other vector stores like Chroma or Pinecone instead of FAISS.

Troubleshooting

  • If you get authentication errors, verify your AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT environment variables.
  • If embeddings fail, ensure your deployment supports the embedding model text-embedding-3-large.
  • For slow retrieval, check FAISS index dimension and data size.

Key Takeaways

  • Use AzureOpenAI SDK with environment variables for secure API access.
  • Combine vector search (FAISS) with Azure OpenAI chat completions for effective RAG.
  • Embed documents and queries with the same embedding model for accurate retrieval.
  • Construct prompts by injecting retrieved context before the user query.
  • Test and adjust retrieval count and model parameters for best results.
Verified 2026-04 · gpt-4o, text-embedding-3-large
Verify ↗