How to Beginner to Intermediate · 3 min read

How to reduce LLM API costs

Quick answer

To reduce LLM API costs, use smaller or specialized models like gpt-4o-mini for less complex tasks and implement caching to avoid redundant calls. Additionally, optimize prompt length and batch requests to minimize token usage and API calls.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

# Set environment variable in your shell
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

output

Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (x.y.z)

Step by step

This example demonstrates how to reduce costs by using a smaller model, limiting token usage, and caching responses locally.

python

import os
from openai import OpenAI
import hashlib
import json

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

cache = {}

def get_cache_key(prompt):
    return hashlib.sha256(prompt.encode()).hexdigest()

def call_llm(prompt):
    key = get_cache_key(prompt)
    if key in cache:
        print("Using cached response")
        return cache[key]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    text = response.choices[0].message.content
    cache[key] = text
    return text

if __name__ == "__main__":
    prompt = "Explain the benefits of caching API responses."
    answer = call_llm(prompt)
    print("LLM response:", answer)

output

LLM response: Caching API responses reduces redundant calls, saving tokens and costs while improving response times.

Common variations

You can further reduce costs by:

Using asynchronous calls to batch multiple prompts.
Choosing specialized or smaller models like gpt-4o-mini or gpt-4o depending on task complexity.
Truncating or summarizing prompts to reduce token count.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def call_llm_async(prompt):
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "What is RAG in AI?",
        "Summarize the benefits of caching.",
        "How to optimize token usage?"
    ]
    results = await asyncio.gather(*(call_llm_async(p) for p in prompts))
    for i, res in enumerate(results):
        print(f"Response {i+1}: {res}")

if __name__ == "__main__":
    asyncio.run(main())

output

Response 1: RAG stands for Retrieval-Augmented Generation, combining retrieval with generation.
Response 2: Caching reduces redundant API calls, saving tokens and improving speed.
Response 3: Optimize token usage by shortening prompts and batching requests.

Troubleshooting

If you notice unexpectedly high costs, check for:

Excessively long prompts or responses increasing token usage.
Repeated API calls without caching.
Using large models unnecessarily.

Use logging to monitor token usage per request and adjust accordingly.

✅

Key Takeaways

Use smaller or specialized models like gpt-4o-mini for cost-sensitive tasks.
Implement caching to avoid repeated API calls for the same prompt.
Limit prompt and response length to reduce token consumption.
Batch multiple prompts asynchronously to optimize API usage.
Monitor token usage regularly to identify cost spikes.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗