Code intermediate · 3 min read

How to build an agent with LlamaIndex

Direct answer
Use LlamaIndex to load and index your documents, then create an agent by combining the index with a language model like OpenAI GPT-4o to answer queries interactively.

Setup

Install
bash
pip install llama-index openai
Env vars
OPENAI_API_KEY
Imports
python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI

Examples

inWhat is the main topic of the documents?
outThe documents primarily discuss renewable energy technologies and their impact on climate change.
inSummarize the key points from the indexed files.
outThe key points include solar and wind energy benefits, challenges in adoption, and recent policy developments.
inWho authored the documents and when?
outThe documents were authored by the Environmental Research Group in 2025.

Integration steps

  1. Install LlamaIndex and OpenAI Python SDK and set your OPENAI_API_KEY in environment variables.
  2. Load your documents using LlamaIndex's SimpleDirectoryReader or other loaders.
  3. Create an LLMPredictor with OpenAI's GPT-4o model wrapped by LlamaIndex's ServiceContext.
  4. Build a GPTSimpleVectorIndex from the loaded documents and the service context.
  5. Query the index with natural language questions to get AI-generated answers.
  6. Print or process the returned responses from the agent.

Full code

python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Wrap OpenAI client in LLMPredictor for LlamaIndex
llm_predictor = LLMPredictor(llm=lambda prompt: client.chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": prompt}]))

# Load documents from a directory
documents = SimpleDirectoryReader('data').load_data()

# Create service context with the LLM predictor
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

# Build the vector index
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

# Query the index
query = "What are the main topics covered in the documents?"
response = index.query(query)

print("Agent response:", response.response)

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What are the main topics covered in the documents?"}]}
Response
json
{"choices": [{"message": {"content": "The documents cover topics related to renewable energy, including solar and wind power technologies, their benefits, challenges, and policy implications."}}]}
Extractresponse.choices[0].message.content

Variants

Streaming Agent Response

Use streaming when you want to display partial responses in real-time for better user experience.

python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Use streaming for chat completions
def streaming_llm_predictor(prompt):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in response:
        print(chunk.choices[0].delta.get('content', ''), end='', flush=True)

llm_predictor = LLMPredictor(llm=streaming_llm_predictor)
documents = SimpleDirectoryReader('data').load_data()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

query = "Summarize the documents."
print("Agent response:", end=' ')
index.query(query)
Async Agent Query

Use async when integrating the agent into an async web server or concurrent environment.

python
import os
import asyncio
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_query(index, query):
    response = await index.aquery(query)
    print("Agent async response:", response.response)

llm_predictor = LLMPredictor(llm=lambda prompt: client.chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": prompt}]))
documents = SimpleDirectoryReader('data').load_data()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

asyncio.run(async_query(index, "What is the summary of the documents?"))
Alternative Model: Using GPT-4o-mini

Use GPT-4o-mini for faster, lower-cost queries when high precision is less critical.

python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, ServiceContext
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

llm_predictor = LLMPredictor(llm=lambda prompt: client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}]))
documents = SimpleDirectoryReader('data').load_data()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

query = "Explain the main ideas in the documents."
response = index.query(query)
print("Agent response:", response.response)

Performance

Latency~800ms for GPT-4o non-streaming queries
Cost~$0.002 per 500 tokens with GPT-4o
Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute
  • Use concise prompts to reduce token usage.
  • Cache index results to avoid repeated calls.
  • Use smaller models like gpt-4o-mini for less critical queries.
ApproachLatencyCost/callBest for
Standard GPT-4o Agent~800ms~$0.002High-quality, general-purpose queries
Streaming AgentStarts immediately, total ~800ms~$0.002Interactive UIs needing partial results
Async Agent~800ms~$0.002Concurrent or async applications
GPT-4o-mini Agent~400ms~$0.0005Cost-sensitive or lower-precision needs

Quick tip

Wrap your OpenAI client in LlamaIndex's LLMPredictor and ServiceContext to seamlessly integrate GPT models with your document index.

Common mistake

Not setting the OpenAI API key in the environment or mismatching the LLM predictor interface causes silent failures.

Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗