How to reduce LlamaIndex costs
Quick answer
To reduce
LlamaIndex costs, minimize the amount of data indexed by filtering and chunking documents efficiently. Use caching to avoid repeated queries and select smaller or cheaper models for embedding and querying when possible.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install llama-index openai
Setup
Install llama-index and openai Python packages, and set your OpenAI API key as an environment variable.
pip install llama-index openai Step by step
This example shows how to reduce costs by chunking documents, limiting indexed data, and caching query results.
import os
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, ServiceContext, LLMPredictor, PromptHelper
from openai import OpenAI
# Set up OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Configure prompt helper to limit token usage
max_input_size = 4096
num_output = 512
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
# Use smaller model for embeddings to save cost
llm_predictor = LLMPredictor(client=client, model_name="gpt-4o-mini", temperature=0)
service_context = ServiceContext.from_defaults(
llm_predictor=llm_predictor,
prompt_helper=prompt_helper
)
# Load and chunk documents efficiently
documents = SimpleDirectoryReader('data').load_data()
# Build index with limited data
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
# Cache query results to avoid repeated calls
cache = {}
def query_index(query_text):
if query_text in cache:
return cache[query_text]
response = index.query(query_text)
cache[query_text] = response
return response
# Example query
result = query_index("What is the main topic?")
print(result.response) output
The main topic is ... (depends on your documents)
Common variations
- Use async calls with
asyncioto batch queries. - Switch to cheaper embedding models like
text-embedding-3-smallif supported. - Limit document size or preprocess to remove irrelevant content before indexing.
Troubleshooting
- If you see high token usage, reduce
max_input_sizeor chunk size. - If queries are slow, implement caching or reduce model size.
- Check environment variable
OPENAI_API_KEYis set correctly to avoid authentication errors.
Key Takeaways
- Filter and chunk documents to reduce the amount of data indexed by
LlamaIndex. - Use smaller or cheaper models for embeddings and queries to lower API costs.
- Implement caching to avoid repeated expensive queries to the index.