How to beginner · 3 min read

How to set max tokens in OpenAI API

Quick answer
Use the max_tokens parameter in the client.chat.completions.create() method to limit the maximum number of tokens in the response. Set max_tokens to an integer value representing the token limit when calling the API.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the official OpenAI Python SDK and set your API key as an environment variable.

bash
pip install openai>=1.0

Step by step

This example shows how to set max_tokens to limit the response length when creating a chat completion with the gpt-4o model.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain the theory of relativity."}],
    max_tokens=100  # Limit response to 100 tokens
)

print(response.choices[0].message.content)
output
The theory of relativity, developed by Albert Einstein, consists of two parts: special relativity and general relativity. Special relativity addresses the physics of objects moving at constant speeds, especially near the speed of light, introducing concepts like time dilation and length contraction. General relativity extends this to include gravity as the curvature of spacetime caused by mass and energy.

Common variations

  • Use different models like gpt-4o-mini or gpt-4o by changing the model parameter.
  • For asynchronous calls, use async client methods with asyncio.
  • Streaming responses do not support max_tokens directly but can be combined with token limits client-side.
python
import asyncio
import os
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Summarize quantum computing."}],
        max_tokens=50
    )
    print(response.choices[0].message.content)

asyncio.run(main())
output
Quantum computing uses quantum bits or qubits, which can represent multiple states simultaneously, enabling complex computations much faster than classical computers for certain problems.

Troubleshooting

  • If you receive an error about token limits, ensure your max_tokens plus prompt tokens do not exceed the model's maximum context length.
  • Setting max_tokens too low may truncate responses prematurely; increase it if output is incomplete.
  • Check your environment variable OPENAI_API_KEY is set correctly to avoid authentication errors.

Key Takeaways

  • Set max_tokens in client.chat.completions.create() to control response length.
  • Ensure total tokens (prompt + max_tokens) stay within model limits to avoid errors.
  • Use environment variables for API keys to keep credentials secure.
  • Async calls support max_tokens similarly to sync calls.
  • Adjust max_tokens based on desired response detail and length.
Verified 2026-04 · gpt-4o, gpt-4o-mini, gpt-4o
Verify ↗