How to beginner · 3 min read

How to reduce token usage in prompts

Quick answer
To reduce token usage in prompts, use concise and clear language, avoid unnecessary context or repetition, and leverage prompt templates or variables. Additionally, truncate or summarize long inputs before sending them to the model to minimize tokens.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to securely authenticate your requests.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates how to reduce token usage by using a concise prompt and avoiding redundant context. It also shows how to truncate long inputs before sending them.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example long input that we truncate
long_input = """This is a very long text that contains a lot of unnecessary details and background information that is not relevant to the question. " \
             "We want to summarize or truncate it to save tokens before sending to the model."""

# Truncate input to first 100 characters to reduce tokens
truncated_input = long_input[:100]

prompt = f"Summarize this briefly: {truncated_input}"

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print("Response:", response.choices[0].message.content)
output
Response: This text contains unnecessary details; in brief, it focuses on summarizing or truncating input to save tokens before sending to the model.

Common variations

You can also reduce tokens by using prompt templates with placeholders to avoid repeating static context. Using smaller models like gpt-4o-mini can further reduce token costs. Async calls and streaming responses help optimize usage in interactive apps.

python
import os
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_prompt():
    prompt_template = "Answer concisely: {question}"
    question = "What is RAG in AI?"
    prompt = prompt_template.format(question=question)

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Async response:", response.choices[0].message.content)

asyncio.run(async_prompt())
output
Async response: RAG stands for Retrieval-Augmented Generation, combining retrieval of documents with generation for better answers.

Troubleshooting

  • If your token usage is unexpectedly high, check for repeated or verbose context in prompts.
  • Use token counting tools or SDK utilities to measure prompt length before sending.
  • Ensure you are not sending unnecessary system messages or metadata.

Key Takeaways

  • Use concise, clear prompts and avoid redundant context to reduce tokens.
  • Truncate or summarize long inputs before sending to the model.
  • Leverage prompt templates with variables to minimize repeated text.
  • Choose smaller models like gpt-4o-mini for cost-sensitive use cases.
  • Measure token usage with SDK tools to optimize prompt design.
Verified 2026-04 · gpt-4o-mini
Verify ↗