Code Beginner easy · 4 min

model.generate(): the generation call

What you will learn

model.generate() takes tokenized input and produces new tokens one at a time until it hits a stopping condition.

Why this matters

This is the only method you call to get text output from a language model: it's the bridge between "I have a prompt" and "I have generated text." Understanding its mechanics prevents common issues like infinite loops, memory bloat, and nonsensical output.

Skip if: When you need to extract hidden representations or perform classification tasks. Use the model directly (e.g., model(input_ids).logits) or specialized pipelines instead. generate() is specifically for autoregressive text production.

Explanation

model.generate() is the method that runs a trained language model in inference mode, producing new tokens sequentially based on the input you provide. It doesn't train: it predicts the next token, adds it to the sequence, feeds that back in, and repeats.

Mechanically: you pass in tokenized input (input_ids), and generate() runs a loop that (1) passes the sequence through the model to get logits for the next token, (2) samples or picks the highest-probability token, (3) appends it to your sequence, and (4) repeats until it hits a stopping condition: either max_new_tokens is reached, or an end-of-sequence token is generated.

Use generate() whenever you want a model to produce new text given a prompt. It handles the iteration loop for you and includes built-in safety guardrails like max_length to prevent runaway generation.

Analogy

Think of it like predictive text on your phone. You type "Hello", and the phone suggests the next word. You accept it, and now it suggests the next word based on "Hello world". generate() automates that entire chain: it predicts forward, appends, predicts again, until it decides the message is complete.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "The future of artificial intelligence"
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]

generated_ids = model.generate(
    input_ids,
    max_new_tokens=20,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

Output

The future of artificial intelligence is a complex and multifaceted topic that has captured the imagination of scientists, technologists, and the general public alike. In recent years, advances

What just happened?

We loaded a pretrained GPT-2 model and tokenizer, tokenized the prompt "The future of artificial intelligence" into input_ids, passed it to generate() with a max of 20 new tokens, used temperature and top_p for controlled randomness, then decoded the resulting token IDs back into human-readable text. The model generated 20 additional tokens after the original prompt.

Common gotcha

Forgetting to set pad_token_id or eos_token_id causes silent warnings and can produce lower-quality text. Also, not setting max_new_tokens on models that don't naturally stop often results in padding repetition. Always explicitly set these parameters: don't rely on defaults.

Error recovery

RuntimeError: Expected all tensors to be on the same device

Your input_ids are on CPU but the model is on GPU (or vice versa). Fix: move input_ids to the same device as the model: input_ids = input_ids.to(model.device) before passing to generate().

ValueError: token_ids in `generated_ids` larger than vocabulary size

You didn't set eos_token_id and the model generated garbage indices. Fix: Always pass pad_token_id=tokenizer.eos_token_id and eos_token_id=tokenizer.eos_token_id to generate().

OutOfMemoryError during generation

max_new_tokens is too large or you're using beam_search which caches all hypotheses. Fix: Reduce max_new_tokens (start with 50), or use do_sample=True instead of num_beams > 1 for memory efficiency.

Experienced dev note

The single biggest footgun: generate() defaults are tuned for diversity, not quality. In production, you almost always want deterministic output (temperature=0, top_k=None, top_p=1.0, do_sample=False) unless you have a specific reason for randomness. Test your temperature and sampling settings with realistic input: what works on a short prompt often fails on longer context.

Check your understanding

If you call model.generate() twice with identical input_ids but different temperature values, why would the outputs be different if neither uses sampling, and why would they be the same if both use temperature=0?

Show answer hint

A correct answer explains that temperature controls the probability distribution used to pick the next token: lower temperature makes high-probability tokens more likely to be picked. Temperature=0 means always pick the highest-probability token deterministically (no randomness), so identical inputs always produce identical outputs. Non-zero temperature with do_sample=True introduces randomness, so outputs differ. If do_sample=False (the default), even temperature != 0 doesn't introduce randomness; only max_new_tokens and the model's weights affect the output.

VERSION In transformers < 5.0, generate() used a different internal API and didn't support device_map='auto' as cleanly. In 5.5.x, always use AutoModelForCausalLM.from_pretrained(model_name, device_map='auto') for seamless multi-GPU support. The generate() signature itself is stable, but device handling changed.

Next, learn how to control the quality and diversity of generated text by tuning temperature, top_p, and sampling strategies: the parameters that make generate() produce either coherent prose or creative randomness.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.