How to Intermediate · 3 min read

How to reduce LLM serving costs

Q: How to reduce LLM serving costs

To reduce LLM serving costs, optimize prompts to minimize token usage, select smaller or specialized models like gpt-4o-mini, and implement batching and caching strategies. Additionally, use asynchronous calls and monitor usage to avoid unnecessary requests.

Quick answer

To reduce LLM serving costs, optimize prompts to minimize token usage, select smaller or specialized models like gpt-4o-mini, and implement batching and caching strategies. Additionally, use asynchronous calls and monitor usage to avoid unnecessary requests.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable to securely authenticate requests.

bash

pip install openai>=1.0

Step by step

This example demonstrates how to reduce serving costs by using a smaller model, limiting token usage, and batching multiple prompts in one request.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Batch multiple prompts to reduce overhead
prompts = [
    "Summarize the benefits of AI.",
    "Explain how caching reduces costs.",
    "List ways to optimize LLM usage."
]

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100  # Limit tokens per prompt
    )
    print(f"Response:", response.choices[0].message.content)

output

Response: AI improves efficiency, automates tasks, and enables new innovations.
Response: Caching stores frequent responses to avoid repeated costly calls.
Response: Optimize prompts, select smaller models, batch requests, and monitor usage.

Common variations

You can use asynchronous calls to improve throughput and reduce latency, or switch models like gpt-4o for higher quality at higher cost. Streaming responses can also help by processing tokens as they arrive.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_chat():
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Explain batching in LLMs."}],
        max_tokens=50
    )
    print(response.choices[0].message.content)

asyncio.run(async_chat())

output

Batching groups multiple prompts into one request to reduce overhead and cost.

Troubleshooting

If you see unexpectedly high token usage, check your prompt length and model choice. Use token counting tools to estimate costs before sending requests. If latency is high, try batching or asynchronous calls.

✅

Key Takeaways

Use smaller or specialized models like gpt-4o-mini to lower per-token costs.
Batch multiple prompts in a single API call to reduce overhead and improve efficiency.
Limit max_tokens and optimize prompt length to minimize token consumption.
Implement caching for repeated queries to avoid redundant API calls.
Monitor usage and switch to asynchronous calls to improve throughput and reduce latency.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗