max_new_tokens vs max_length
Why this matters
Getting these confused causes silent failures where your generation stops unexpectedly or produces incomplete responses, wasting API calls and breaking production pipelines.
Explanation
max_new_tokens and max_length sound similar but control completely different things. max_new_tokens is the number of tokens the model is allowed to generate going forward from the current position. max_length is the absolute maximum total length of the entire sequence (prompt + generated tokens combined).
When you pass a 100-token prompt with max_length=150, the model can only generate 50 new tokens. But with max_new_tokens=100, the model generates 100 new tokens regardless of prompt length, creating a 200-token output total. In transformers 5.5.x, mixing these parameters causes the API to prefer max_new_tokens and warn about max_length deprecation.
Use max_new_tokens for generation tasks: it's explicit, predictable, and what the framework expects. max_length exists for backward compatibility but introduces confusion in real code.
Analogy
Think of max_length as a total bucket capacity (50-gallon tank), but max_new_tokens is how much you're pouring in this moment (5 gallons). If your bucket already has water (the prompt), max_length means you can't fill much more. max_new_tokens doesn't care what's already there.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print(f"Prompt token count: {input_ids.shape[1]}")
print()
print("=== Using max_new_tokens=20 ===")
output_max_new = model.generate(
input_ids,
max_new_tokens=20,
pad_token_id=tokenizer.eos_token_id,
do_sample=False
)
generated_text_new = tokenizer.decode(output_max_new[0], skip_special_tokens=True)
print(f"Output length: {output_max_new.shape[1]} tokens")
print(f"Generated text: {generated_text_new}")
print()
print("=== Using max_length=30 (prompt=11 tokens, so ~19 new tokens) ===")
output_max_len = model.generate(
input_ids,
max_length=30,
pad_token_id=tokenizer.eos_token_id,
do_sample=False
)
generated_text_len = tokenizer.decode(output_max_len[0], skip_special_tokens=True)
print(f"Output length: {output_max_len.shape[1]} tokens")
print(f"Generated text: {generated_text_len}") Prompt token count: 5 === Using max_new_tokens=20 === Output length: 25 tokens Generated text: The future of AI is very bright and exciting. I think we will see a lot of new technologies and innovations. We will also see a lot of new opportunities for people === Using max_length=30 (prompt=11 tokens, so ~19 new tokens) === Output length: 30 tokens Generated text: The future of AI is very bright and exciting. I think we will see a lot of new technologies and innovations. We will also see
What just happened?
The code tokenized a 5-token prompt ("The future of AI is"), then generated text twice: once with max_new_tokens=20 (which produced 25 total tokens: 5 original + 20 new), and once with max_length=30 (which clamped the total to exactly 30 tokens, allowing only ~19 new tokens). Both outputs show different generation lengths because they're measuring different things: one measures tokens to generate, the other measures the absolute ceiling.
Common gotcha
Developers often set max_length thinking it controls generation budget, then wonder why short prompts produce fewer tokens than expected. max_length subtracts the prompt length automatically, silently reducing generation. Always use max_new_tokens and avoid max_length entirely.
Error recovery
UserWarning about max_lengthOutput shorter than expectedExperienced dev note
In transformers < 4.30, max_length was the standard and max_new_tokens didn't exist. Codebases mixing both parameters silently ignore max_length in favor of max_new_tokens. If you inherit code with max_length, remove it immediately: it's dead weight creating confusion. The framework prioritizes max_new_tokens, so your max_length is probably doing nothing anyway.
Check your understanding
You have a 50-token prompt. You set max_new_tokens=100. Your colleague sets max_length=120 on the same prompt. Which will produce a longer output, and by how many tokens? Why?
Show answer hint
The correct answer requires understanding that max_new_tokens is absolute (100 tokens generated) while max_length is a total ceiling minus prompt (120 - 50 = only 70 new tokens). max_new_tokens wins by 30 tokens. The insight is that max_length is relative to prompt length, max_new_tokens is not.