Code intermediate · 4 min read

How to build a RAG system with LlamaIndex

Direct answer
Use LlamaIndex to create an index from your documents, then query it with an LLM like gpt-4o to build a RAG system that retrieves relevant context and generates answers.

Setup

Install
bash
pip install llama-index openai
Env vars
OPENAI_API_KEY
Imports
python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from openai import OpenAI

Examples

inQuery: 'What is the capital of France?'
outAnswer: 'The capital of France is Paris.'
inQuery: 'Explain the benefits of renewable energy.'
outAnswer: 'Renewable energy reduces greenhouse gas emissions, lowers energy costs, and promotes sustainability.'
inQuery: 'Who wrote the novel 1984?'
outAnswer: 'The novel 1984 was written by George Orwell.'

Integration steps

  1. Install LlamaIndex and OpenAI Python SDK and set your OPENAI_API_KEY in environment variables.
  2. Load your documents using LlamaIndex's SimpleDirectoryReader or other loaders.
  3. Create a vector index from the documents with GPTSimpleVectorIndex.
  4. Initialize the OpenAI client with your API key.
  5. Query the index with your question to retrieve relevant context and generate an answer.
  6. Print or use the generated response in your application.

Full code

python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from openai import OpenAI

# Load documents from a local directory
documents = SimpleDirectoryReader('data').load_data()

# Build the vector index
index = GPTSimpleVectorIndex(documents)

# Initialize OpenAI client
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

# Define a query
query = 'What is the capital of France?'

# Query the index to get a response
response = index.query(query, llm_predictor=lambda prompt: client.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': prompt}]
).choices[0].message.content)

print('Answer:', response)

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "<retrieved context + user query>"}]}
Response
json
{"choices": [{"message": {"content": "The capital of France is Paris."}}], "usage": {"total_tokens": 50}}
Extractresponse.choices[0].message.content

Variants

Streaming RAG Query

Use streaming to provide partial answers in real-time for better user experience with long responses.

python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from openai import OpenAI

documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

query = 'Explain renewable energy benefits.'

# Streaming call
response_stream = client.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': index.query(query)}],
    stream=True
)

for chunk in response_stream:
    print(chunk.choices[0].delta.get('content', ''), end='')
Async RAG Query

Use async calls when integrating RAG in applications requiring concurrency or non-blocking behavior.

python
import os
import asyncio
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from openai import OpenAI

async def main():
    documents = SimpleDirectoryReader('data').load_data()
    index = GPTSimpleVectorIndex(documents)
    client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

    query = 'Who wrote 1984?'

    response = await client.chat.completions.acreate(
        model='gpt-4o',
        messages=[{'role': 'user', 'content': index.query(query)}]
    )
    print('Answer:', response.choices[0].message.content)

asyncio.run(main())
Alternative Model: Claude 3.5 Sonnet

Use Claude 3.5 Sonnet for higher coding accuracy or alternative LLM preferences.

python
import os
import anthropic
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader

client = anthropic.Anthropic(api_key=os.environ['ANTHROPIC_API_KEY'])
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)

query = 'What is the capital of France?'

response = client.messages.create(
    model='claude-3-5-sonnet-20241022',
    max_tokens=512,
    system='You are a helpful assistant.',
    messages=[{'role': 'user', 'content': index.query(query)}]
)

print('Answer:', response.content[0].text)

Performance

Latency~800ms for gpt-4o non-streaming query
Cost~$0.002 per 500 tokens exchanged with gpt-4o
Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute
  • Use vector indexes to limit context size and reduce tokens sent to the LLM.
  • Cache frequent queries to avoid repeated API calls.
  • Summarize or chunk documents before indexing to optimize token usage.
ApproachLatencyCost/callBest for
Basic RAG with GPT-4o~800ms~$0.002General purpose retrieval and generation
Streaming RAGStarts in ~300ms, streams over time~$0.002Long answers with better UX
Async RAG~800ms (concurrent)~$0.002Concurrent or high-throughput apps
Claude 3.5 Sonnet~700ms~$0.0025Higher coding accuracy and nuanced responses

Quick tip

Pre-index your documents with LlamaIndex to drastically reduce token usage and latency during queries.

Common mistake

Beginners often forget to pass the LLM predictor function to <code>index.query</code>, causing it to return raw text instead of generated answers.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗