How to reduce output tokens from LLM
Quick answer
To reduce output tokens from an LLM, set the
max_tokens parameter to limit response length and design concise prompts that guide the model to generate shorter answers. Additionally, adjusting temperature to lower values can produce more focused, less verbose outputs.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Use the max_tokens parameter to cap the output length and craft prompts that explicitly request brief answers.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Summarize the benefits of AI in one sentence."
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=30,
temperature=0.3
)
print("Output:", response.choices[0].message.content) output
Output: AI improves efficiency and decision-making by automating tasks and analyzing data quickly.
Common variations
You can adjust temperature to control verbosity and use different models like gpt-4o-mini for shorter, faster responses. Async calls and streaming output can also be used for interactive applications.
import asyncio
import os
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Explain blockchain in two sentences."
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=50,
temperature=0.2,
stream=True
)
async for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
asyncio.run(main()) output
Blockchain is a decentralized ledger that records transactions securely and transparently. It enables trustless peer-to-peer interactions without intermediaries.
Troubleshooting
If your output is still too long, reduce max_tokens further or refine your prompt to explicitly request brevity. Also, check if temperature is too high, which can cause verbose responses.
Key Takeaways
- Use
max_tokensto directly limit the number of output tokens from the LLM. - Design prompts that explicitly ask for concise or brief answers to reduce verbosity.
- Lower
temperaturevalues produce more focused and less verbose outputs. - Choose smaller or faster models like
gpt-4o-minifor shorter responses and cost savings. - Streaming and async calls enable efficient handling of output tokens in interactive apps.