Concept Beginner · 3 min read

What is max tokens in LLMs

Quick answer
The max tokens in large language models (LLMs) is the maximum number of tokens the model can process in a single request, including both input and output tokens. It limits how much text you can send and receive in one call, affecting prompt length and response size.
Max tokens is a limit in large language models that defines the maximum number of tokens processed in one request, controlling input and output length.

How it works

Max tokens sets the total token budget for a single interaction with an LLM, combining both the prompt tokens and the generated completion tokens. Think of it like a container with a fixed volume: you can fill it with some input text, but the leftover space determines how much output the model can produce. Tokens are chunks of text—words or parts of words—so the max tokens limit controls the total text length the model handles at once.

For example, if a model has a max tokens limit of 4,096 and your prompt uses 1,000 tokens, the model can generate up to 3,096 tokens in response. If your prompt is too long, the model’s output will be truncated or limited.

Concrete example

Here’s how to specify max_tokens in an OpenAI API call using gpt-4o model:

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain max tokens in simple terms."}],
    max_tokens=100  # limit output to 100 tokens
)

print(response.choices[0].message.content)
output
Max tokens define the maximum length of text the model can generate or process in one request, including your input and the output combined.

When to use it

Use max tokens to control response length and cost. Set it lower when you want concise answers or to avoid hitting token limits. Set it higher for detailed explanations or long-form content generation. Avoid exceeding the model’s max tokens limit, or your request will fail or truncate output.

For example, use a smaller max tokens for chatbots needing quick replies, and a larger one for document summarization or code generation tasks.

Key Takeaways

  • Max tokens limits the total input plus output tokens in one LLM request.
  • Plan prompt length and max tokens together to avoid truncation or errors.
  • Adjust max tokens based on desired response length and use case.
  • Different models have different max token limits; check model docs before use.
Verified 2026-04 · gpt-4o
Verify ↗