How to intermediate · 3 min read

Azure OpenAI RAG architecture best practices

Quick answer
Use Azure OpenAI with a vector database like Azure Cognitive Search or Pinecone to implement Retrieval-Augmented Generation (RAG). Optimize by chunking documents, embedding with text-embedding-3-large, caching embeddings, and designing concise prompts that combine retrieved context with user queries.

PREREQUISITES

  • Python 3.8+
  • Azure OpenAI API key
  • Azure Cognitive Search or vector DB access
  • pip install openai>=1.0 azure-search-documents

Setup

Install the required Python packages and set environment variables for Azure OpenAI and Azure Cognitive Search. Ensure you have an Azure OpenAI resource and a vector search service configured.

bash
pip install openai azure-search-documents

Step by step

This example demonstrates a simple RAG pipeline using Azure OpenAI for generation and Azure Cognitive Search for vector retrieval. It embeds documents, queries the vector store, and sends context with the user query to the chat model.

python
import os
from openai import AzureOpenAI
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

# Initialize Azure OpenAI client
client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01"
)

# Initialize Azure Cognitive Search client
search_client = SearchClient(
    endpoint=os.environ["AZURE_SEARCH_ENDPOINT"],
    index_name=os.environ["AZURE_SEARCH_INDEX"],
    credential=AzureKeyCredential(os.environ["AZURE_SEARCH_API_KEY"])
)

# Function to embed text using Azure OpenAI embeddings
# Note: Azure OpenAI embeddings endpoint may differ; adjust accordingly

def embed_text(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

# Query vector search with embedded user query
user_query = "Explain the benefits of RAG architecture"
query_embedding = embed_text(user_query)

results = search_client.search(
    search_text="*",  # wildcard to use vector search
    vector=query_embedding,
    top=3,
    vector_fields="embeddingVector"
)

# Aggregate retrieved documents
context = "\n\n".join([doc["content"] for doc in results])

# Compose prompt with context and user query
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"}
]

# Call Azure OpenAI chat completion
response = client.chat.completions.create(
    model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
    messages=messages
)

print(response.choices[0].message.content)
output
Azure OpenAI RAG architecture improves accuracy by combining retrieval of relevant documents with generation, reducing hallucinations and enabling up-to-date responses.

Common variations

  • Use async calls with asyncio for higher throughput.
  • Swap Azure Cognitive Search with other vector DBs like Pinecone or Weaviate.
  • Experiment with different embedding models such as text-embedding-3-small for faster embedding.
  • Implement caching for embeddings and retrieved documents to reduce latency and cost.

Troubleshooting

  • If retrieval returns no results, verify your vector index is populated and the embedding dimension matches the model.
  • For prompt length errors, chunk documents smaller or summarize context before sending.
  • If you get authentication errors, confirm environment variables and API keys are correct and have proper permissions.
  • Monitor Azure OpenAI usage quotas and scale your Cognitive Search tier accordingly.

Key Takeaways

  • Use vector search with Azure Cognitive Search or similar to retrieve relevant context for RAG.
  • Embed queries and documents with Azure OpenAI embedding models like text-embedding-3-large.
  • Design prompts that combine retrieved context and user queries for accurate generation.
  • Cache embeddings and retrieval results to optimize latency and cost.
  • Chunk or summarize documents to fit within token limits and avoid prompt truncation.
Verified 2026-04 · gpt-4o, text-embedding-3-large
Verify ↗