How to reduce LLM API costs
Quick answer
To reduce LLM API costs, use smaller or specialized models like gpt-4o-mini for less complex tasks and implement caching to avoid redundant calls. Additionally, optimize prompt length and batch requests to minimize token usage and API calls.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable for secure access.
pip install openai>=1.0
# Set environment variable in your shell
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"] output
Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (x.y.z)
Step by step
This example demonstrates how to reduce costs by using a smaller model, limiting token usage, and caching responses locally.
import os
from openai import OpenAI
import hashlib
import json
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
cache = {}
def get_cache_key(prompt):
return hashlib.sha256(prompt.encode()).hexdigest()
def call_llm(prompt):
key = get_cache_key(prompt)
if key in cache:
print("Using cached response")
return cache[key]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
text = response.choices[0].message.content
cache[key] = text
return text
if __name__ == "__main__":
prompt = "Explain the benefits of caching API responses."
answer = call_llm(prompt)
print("LLM response:", answer) output
LLM response: Caching API responses reduces redundant calls, saving tokens and costs while improving response times.
Common variations
You can further reduce costs by:
- Using asynchronous calls to batch multiple prompts.
- Choosing specialized or smaller models like
gpt-4o-miniorgpt-4odepending on task complexity. - Truncating or summarizing prompts to reduce token count.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def call_llm_async(prompt):
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content
async def main():
prompts = [
"What is RAG in AI?",
"Summarize the benefits of caching.",
"How to optimize token usage?"
]
results = await asyncio.gather(*(call_llm_async(p) for p in prompts))
for i, res in enumerate(results):
print(f"Response {i+1}: {res}")
if __name__ == "__main__":
asyncio.run(main()) output
Response 1: RAG stands for Retrieval-Augmented Generation, combining retrieval with generation. Response 2: Caching reduces redundant API calls, saving tokens and improving speed. Response 3: Optimize token usage by shortening prompts and batching requests.
Troubleshooting
If you notice unexpectedly high costs, check for:
- Excessively long prompts or responses increasing token usage.
- Repeated API calls without caching.
- Using large models unnecessarily.
Use logging to monitor token usage per request and adjust accordingly.
Key Takeaways
- Use smaller or specialized models like gpt-4o-mini for cost-sensitive tasks.
- Implement caching to avoid repeated API calls for the same prompt.
- Limit prompt and response length to reduce token consumption.
- Batch multiple prompts asynchronously to optimize API usage.
- Monitor token usage regularly to identify cost spikes.