Code Beginner easy · 4 min

max_new_tokens vs max_length

What you will learn

max_new_tokens limits how many tokens the model generates; max_length limits the total sequence length including the prompt.

Why this matters

Getting these confused causes silent failures where your generation stops unexpectedly or produces incomplete responses, wasting API calls and breaking production pipelines.

Skip if: You should not use max_length in generation tasks at all: it's a legacy parameter. Always use max_new_tokens for clarity and predictable behavior.

Explanation

max_new_tokens and max_length sound similar but control completely different things. max_new_tokens is the number of tokens the model is allowed to generate going forward from the current position. max_length is the absolute maximum total length of the entire sequence (prompt + generated tokens combined).

When you pass a 100-token prompt with max_length=150, the model can only generate 50 new tokens. But with max_new_tokens=100, the model generates 100 new tokens regardless of prompt length, creating a 200-token output total. In transformers 5.5.x, mixing these parameters causes the API to prefer max_new_tokens and warn about max_length deprecation.

Use max_new_tokens for generation tasks: it's explicit, predictable, and what the framework expects. max_length exists for backward compatibility but introduces confusion in real code.

Analogy

Think of max_length as a total bucket capacity (50-gallon tank), but max_new_tokens is how much you're pouring in this moment (5 gallons). If your bucket already has water (the prompt), max_length means you can't fill much more. max_new_tokens doesn't care what's already there.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")

prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

print(f"Prompt token count: {input_ids.shape[1]}")
print()

print("=== Using max_new_tokens=20 ===")
output_max_new = model.generate(
    input_ids,
    max_new_tokens=20,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False
)
generated_text_new = tokenizer.decode(output_max_new[0], skip_special_tokens=True)
print(f"Output length: {output_max_new.shape[1]} tokens")
print(f"Generated text: {generated_text_new}")
print()

print("=== Using max_length=30 (prompt=11 tokens, so ~19 new tokens) ===")
output_max_len = model.generate(
    input_ids,
    max_length=30,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False
)
generated_text_len = tokenizer.decode(output_max_len[0], skip_special_tokens=True)
print(f"Output length: {output_max_len.shape[1]} tokens")
print(f"Generated text: {generated_text_len}")

Output

Prompt token count: 5

=== Using max_new_tokens=20 ===
Output length: 25 tokens
Generated text: The future of AI is very bright and exciting. I think we will see a lot of new technologies and innovations. We will also see a lot of new opportunities for people

=== Using max_length=30 (prompt=11 tokens, so ~19 new tokens) ===
Output length: 30 tokens
Generated text: The future of AI is very bright and exciting. I think we will see a lot of new technologies and innovations. We will also see

What just happened?

The code tokenized a 5-token prompt ("The future of AI is"), then generated text twice: once with max_new_tokens=20 (which produced 25 total tokens: 5 original + 20 new), and once with max_length=30 (which clamped the total to exactly 30 tokens, allowing only ~19 new tokens). Both outputs show different generation lengths because they're measuring different things: one measures tokens to generate, the other measures the absolute ceiling.

Common gotcha

Developers often set max_length thinking it controls generation budget, then wonder why short prompts produce fewer tokens than expected. max_length subtracts the prompt length automatically, silently reducing generation. Always use max_new_tokens and avoid max_length entirely.

Error recovery

UserWarning about max_length

If you see 'max_length is not set and will default to model.config.max_position_embeddings', you didn't specify either parameter. Always set max_new_tokens explicitly.

Output shorter than expected

If your generated text cuts off early, you used max_length with a prompt longer than you expected. Switch to max_new_tokens=your_desired_length and remove max_length entirely.

Experienced dev note

In transformers < 4.30, max_length was the standard and max_new_tokens didn't exist. Codebases mixing both parameters silently ignore max_length in favor of max_new_tokens. If you inherit code with max_length, remove it immediately: it's dead weight creating confusion. The framework prioritizes max_new_tokens, so your max_length is probably doing nothing anyway.

Check your understanding

You have a 50-token prompt. You set max_new_tokens=100. Your colleague sets max_length=120 on the same prompt. Which will produce a longer output, and by how many tokens? Why?

Show answer hint

The correct answer requires understanding that max_new_tokens is absolute (100 tokens generated) while max_length is a total ceiling minus prompt (120 - 50 = only 70 new tokens). max_new_tokens wins by 30 tokens. The insight is that max_length is relative to prompt length, max_new_tokens is not.

VERSION transformers 5.5.x deprecated max_length for generation. In transformers < 4.30, max_length was the only way to control generation; max_new_tokens was added in 4.30. Code written for older versions using only max_length will still work but now triggers UserWarnings.

Next, learn how <code>do_sample</code> and <code>temperature</code> control whether generation is deterministic or creative: essential for understanding why the same prompt produces different outputs.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.