How to Intermediate · 4 min read

How to build advanced RAG with LlamaIndex

Quick answer
Use LlamaIndex to create an advanced Retrieval-Augmented Generation (RAG) pipeline by combining document ingestion, vector indexing, and a large language model like gpt-4o. This involves loading documents, building a vector index with LlamaIndex, and querying it with context-aware prompts to generate accurate, context-rich responses.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install llama-index openai faiss-cpu

Setup

Install the required packages and set your environment variables for API keys.

bash
pip install llama-index openai faiss-cpu

Step by step

This example demonstrates building an advanced RAG system using LlamaIndex with OpenAI's gpt-4o model and FAISS vector store for efficient retrieval.

python
import os
from llama_index import (SimpleDirectoryReader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage)
from openai import OpenAI
import faiss

# Set your OpenAI API key in environment variable
# export OPENAI_API_KEY=os.environ["ANTHROPIC_API_KEY"]

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load documents from local directory
documents = SimpleDirectoryReader("./docs").load_data()

# Create a service context with OpenAI client and model
service_context = ServiceContext.from_defaults(
    llm=client,
    llm_model="gpt-4o"
)

# Build vector store index with FAISS
index = GPTVectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

# Save index to disk for persistence
index.storage_context.persist(persist_dir="./storage")

# Later, load the index from storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context, service_context=service_context)

# Query the index with a question
query = "Explain the main benefits of Retrieval-Augmented Generation."
response = index.query(query)

print("Response:", response.response)
output
Response: Retrieval-Augmented Generation (RAG) enhances language models by integrating external knowledge through retrieval, improving accuracy, relevance, and reducing hallucinations.

Common variations

  • Use asynchronous calls with asyncio for scalable querying.
  • Switch to other vector stores like Chroma or Weaviate by adapting the storage context.
  • Use different LLMs such as claude-3-5-sonnet-20241022 by changing the llm in ServiceContext.
python
import asyncio

async def async_query(index, query_text):
    response = await index.aquery(query_text)
    print("Async response:", response.response)

# Example usage
# asyncio.run(async_query(index, "What is RAG?"))
output
Async response: Retrieval-Augmented Generation (RAG) combines retrieval of relevant documents with generation to improve response quality.

Troubleshooting

  • If you see API key missing, ensure OPENAI_API_KEY is set in your environment.
  • If vector index loading fails, verify the persist_dir path and that the index was saved correctly.
  • For slow queries, consider reducing max_tokens or using a smaller model like gpt-4o-mini.

Key Takeaways

  • Use LlamaIndex with vector stores like FAISS to build scalable RAG systems.
  • Persist and reload indexes to optimize retrieval performance and reduce costs.
  • Customize ServiceContext to switch LLMs or vector backends easily.
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗