How to build an agent with LlamaIndex
Direct answer
Use
LlamaIndex to load and index your documents, then create an agent by combining the index with a language model like OpenAI GPT-4o to answer queries interactively.Setup
Install
pip install llama-index openai Env vars
OPENAI_API_KEY Imports
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI Examples
inWhat is the main topic of the documents?
outThe documents primarily discuss renewable energy technologies and their impact on climate change.
inSummarize the key points from the indexed files.
outThe key points include solar and wind energy benefits, challenges in adoption, and recent policy developments.
inWho authored the documents and when?
outThe documents were authored by the Environmental Research Group in 2025.
Integration steps
- Install LlamaIndex and OpenAI Python SDK and set your OPENAI_API_KEY in environment variables.
- Load your documents using LlamaIndex's SimpleDirectoryReader or other loaders.
- Create an LLMPredictor with OpenAI's GPT-4o model wrapped by LlamaIndex's ServiceContext.
- Build a GPTSimpleVectorIndex from the loaded documents and the service context.
- Query the index with natural language questions to get AI-generated answers.
- Print or process the returned responses from the agent.
Full code
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Wrap OpenAI client in LLMPredictor for LlamaIndex
llm_predictor = LLMPredictor(llm=lambda prompt: client.chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": prompt}]))
# Load documents from a directory
documents = SimpleDirectoryReader('data').load_data()
# Create service context with the LLM predictor
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
# Build the vector index
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
# Query the index
query = "What are the main topics covered in the documents?"
response = index.query(query)
print("Agent response:", response.response) API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What are the main topics covered in the documents?"}]} Response
{"choices": [{"message": {"content": "The documents cover topics related to renewable energy, including solar and wind power technologies, their benefits, challenges, and policy implications."}}]} Extract
response.choices[0].message.contentVariants
Streaming Agent Response ›
Use streaming when you want to display partial responses in real-time for better user experience.
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Use streaming for chat completions
def streaming_llm_predictor(prompt):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.get('content', ''), end='', flush=True)
llm_predictor = LLMPredictor(llm=streaming_llm_predictor)
documents = SimpleDirectoryReader('data').load_data()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
query = "Summarize the documents."
print("Agent response:", end=' ')
index.query(query) Async Agent Query ›
Use async when integrating the agent into an async web server or concurrent environment.
import os
import asyncio
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def async_query(index, query):
response = await index.aquery(query)
print("Agent async response:", response.response)
llm_predictor = LLMPredictor(llm=lambda prompt: client.chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": prompt}]))
documents = SimpleDirectoryReader('data').load_data()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
asyncio.run(async_query(index, "What is the summary of the documents?")) Alternative Model: Using GPT-4o-mini ›
Use GPT-4o-mini for faster, lower-cost queries when high precision is less critical.
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
llm_predictor = LLMPredictor(llm=lambda prompt: client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}]))
documents = SimpleDirectoryReader('data').load_data()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
query = "Explain the main ideas in the documents."
response = index.query(query)
print("Agent response:", response.response) Performance
Latency~800ms for GPT-4o non-streaming queries
Cost~$0.002 per 500 tokens with GPT-4o
Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute
- Use concise prompts to reduce token usage.
- Cache index results to avoid repeated calls.
- Use smaller models like gpt-4o-mini for less critical queries.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard GPT-4o Agent | ~800ms | ~$0.002 | High-quality, general-purpose queries |
| Streaming Agent | Starts immediately, total ~800ms | ~$0.002 | Interactive UIs needing partial results |
| Async Agent | ~800ms | ~$0.002 | Concurrent or async applications |
| GPT-4o-mini Agent | ~400ms | ~$0.0005 | Cost-sensitive or lower-precision needs |
Quick tip
Wrap your OpenAI client in LlamaIndex's LLMPredictor and ServiceContext to seamlessly integrate GPT models with your document index.
Common mistake
Not setting the OpenAI API key in the environment or mismatching the LLM predictor interface causes silent failures.