How to Intermediate · 4 min read

How to deploy LlamaIndex RAG app

Q: How to deploy LlamaIndex RAG app

Use LlamaIndex to build a Retrieval-Augmented Generation (RAG) app by indexing your documents and querying them with an LLM like gpt-4o. Deploy by integrating LlamaIndex with the OpenAI SDK, loading your data, creating a vector index, and running queries to generate context-aware responses.

Quick answer

Use LlamaIndex to build a Retrieval-Augmented Generation (RAG) app by indexing your documents and querying them with an LLM like gpt-4o. Deploy by integrating LlamaIndex with the OpenAI SDK, loading your data, creating a vector index, and running queries to generate context-aware responses.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install llama-index openai

Setup

Install the required packages and set your OpenAI API key as an environment variable.

bash

pip install llama-index openai

# Set your API key in your shell environment
export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

Step by step

This example shows how to create a simple LlamaIndex RAG app that loads documents, builds an index, and queries it using OpenAI's gpt-4o model.

python

import os
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, ServiceContext
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load documents from a local directory
documents = SimpleDirectoryReader("./data").load_data()

# Create a service context with OpenAI client and model
service_context = ServiceContext.from_defaults(
    llm=client.chat.completions.create,
    model="gpt-4o"
)

# Build the vector index
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

# Query the index
query = "What are the key points in the documents?"
response = index.query(query)

print("Response:", response.response)

output

Response: The key points in the documents are ...

Common variations

Use async calls with OpenAI SDK for better performance.
Switch to other LLMs like gpt-4o-mini for cost efficiency.
Use different data loaders for PDFs or web pages.

python

import asyncio
import os
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, ServiceContext
from openai import OpenAI

async def async_rag_query():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    documents = SimpleDirectoryReader("./data").load_data()
    service_context = ServiceContext.from_defaults(
        llm=client.chat.completions.acreate,
        model="gpt-4o-mini"
    )
    index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
    query = "Summarize the documents."
    response = await index.aquery(query)
    print("Async response:", response.response)

asyncio.run(async_rag_query())

output

Async response: The documents summarize as ...

Troubleshooting

If you get authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
If document loading fails, check the path and file formats supported by SimpleDirectoryReader.
For slow responses, consider using smaller models like gpt-4o-mini or batching queries.

✅

Key Takeaways

Use LlamaIndex with OpenAI's gpt-4o model to build efficient RAG apps.
Load your documents with SimpleDirectoryReader and create a vector index for retrieval.
Deploy with synchronous or asynchronous OpenAI SDK calls depending on your app needs.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗