How to build advanced RAG with LlamaIndex
Quick answer
Use
LlamaIndex to create an advanced Retrieval-Augmented Generation (RAG) pipeline by combining document ingestion, vector indexing, and a large language model like gpt-4o. This involves loading documents, building a vector index with LlamaIndex, and querying it with context-aware prompts to generate accurate, context-rich responses.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install llama-index openai faiss-cpu
Setup
Install the required packages and set your environment variables for API keys.
pip install llama-index openai faiss-cpu Step by step
This example demonstrates building an advanced RAG system using LlamaIndex with OpenAI's gpt-4o model and FAISS vector store for efficient retrieval.
import os
from llama_index import (SimpleDirectoryReader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage)
from openai import OpenAI
import faiss
# Set your OpenAI API key in environment variable
# export OPENAI_API_KEY=os.environ["ANTHROPIC_API_KEY"]
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load documents from local directory
documents = SimpleDirectoryReader("./docs").load_data()
# Create a service context with OpenAI client and model
service_context = ServiceContext.from_defaults(
llm=client,
llm_model="gpt-4o"
)
# Build vector store index with FAISS
index = GPTVectorStoreIndex.from_documents(
documents,
service_context=service_context
)
# Save index to disk for persistence
index.storage_context.persist(persist_dir="./storage")
# Later, load the index from storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context, service_context=service_context)
# Query the index with a question
query = "Explain the main benefits of Retrieval-Augmented Generation."
response = index.query(query)
print("Response:", response.response) output
Response: Retrieval-Augmented Generation (RAG) enhances language models by integrating external knowledge through retrieval, improving accuracy, relevance, and reducing hallucinations.
Common variations
- Use asynchronous calls with
asynciofor scalable querying. - Switch to other vector stores like
ChromaorWeaviateby adapting the storage context. - Use different LLMs such as
claude-3-5-sonnet-20241022by changing thellminServiceContext.
import asyncio
async def async_query(index, query_text):
response = await index.aquery(query_text)
print("Async response:", response.response)
# Example usage
# asyncio.run(async_query(index, "What is RAG?")) output
Async response: Retrieval-Augmented Generation (RAG) combines retrieval of relevant documents with generation to improve response quality.
Troubleshooting
- If you see
API key missing, ensureOPENAI_API_KEYis set in your environment. - If vector index loading fails, verify the
persist_dirpath and that the index was saved correctly. - For slow queries, consider reducing
max_tokensor using a smaller model likegpt-4o-mini.
Key Takeaways
- Use
LlamaIndexwith vector stores like FAISS to build scalable RAG systems. - Persist and reload indexes to optimize retrieval performance and reduce costs.
- Customize
ServiceContextto switch LLMs or vector backends easily.