How to Beginner · 3 min read

How to reduce output tokens from LLM

Q: How to reduce output tokens from LLM

To reduce output tokens from an LLM, set the max_tokens parameter to limit response length and design concise prompts that guide the model to generate shorter answers. Additionally, adjusting temperature to lower values can produce more focused, less verbose outputs.

Quick answer

To reduce output tokens from an LLM, set the max_tokens parameter to limit response length and design concise prompts that guide the model to generate shorter answers. Additionally, adjusting temperature to lower values can produce more focused, less verbose outputs.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the max_tokens parameter to cap the output length and craft prompts that explicitly request brief answers.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = "Summarize the benefits of AI in one sentence."

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=30,
    temperature=0.3
)

print("Output:", response.choices[0].message.content)

output

Output: AI improves efficiency and decision-making by automating tasks and analyzing data quickly.

Common variations

You can adjust temperature to control verbosity and use different models like gpt-4o-mini for shorter, faster responses. Async calls and streaming output can also be used for interactive applications.

python

import asyncio
import os
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    prompt = "Explain blockchain in two sentences."
    
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=50,
        temperature=0.2,
        stream=True
    )

    async for chunk in response:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(main())

output

Blockchain is a decentralized ledger that records transactions securely and transparently. It enables trustless peer-to-peer interactions without intermediaries.

Troubleshooting

If your output is still too long, reduce max_tokens further or refine your prompt to explicitly request brevity. Also, check if temperature is too high, which can cause verbose responses.

Key Takeaways

Use max_tokens to directly limit the number of output tokens from the LLM.
Design prompts that explicitly ask for concise or brief answers to reduce verbosity.
Lower temperature values produce more focused and less verbose outputs.
Choose smaller or faster models like gpt-4o-mini for shorter responses and cost savings.
Streaming and async calls enable efficient handling of output tokens in interactive apps.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.