model.generate(): the generation call
Why this matters
This is the only method you call to get text output from a language model: it's the bridge between "I have a prompt" and "I have generated text." Understanding its mechanics prevents common issues like infinite loops, memory bloat, and nonsensical output.
Explanation
model.generate() is the method that runs a trained language model in inference mode, producing new tokens sequentially based on the input you provide. It doesn't train: it predicts the next token, adds it to the sequence, feeds that back in, and repeats.
Mechanically: you pass in tokenized input (input_ids), and generate() runs a loop that (1) passes the sequence through the model to get logits for the next token, (2) samples or picks the highest-probability token, (3) appends it to your sequence, and (4) repeats until it hits a stopping condition: either max_new_tokens is reached, or an end-of-sequence token is generated.
Use generate() whenever you want a model to produce new text given a prompt. It handles the iteration loop for you and includes built-in safety guardrails like max_length to prevent runaway generation.
Analogy
Think of it like predictive text on your phone. You type "Hello", and the phone suggests the next word. You accept it, and now it suggests the next word based on "Hello world". generate() automates that entire chain: it predicts forward, appends, predicts again, until it decides the message is complete.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "The future of artificial intelligence"
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
generated_ids = model.generate(
input_ids,
max_new_tokens=20,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text) The future of artificial intelligence is a complex and multifaceted topic that has captured the imagination of scientists, technologists, and the general public alike. In recent years, advances
What just happened?
We loaded a pretrained GPT-2 model and tokenizer, tokenized the prompt "The future of artificial intelligence" into input_ids, passed it to generate() with a max of 20 new tokens, used temperature and top_p for controlled randomness, then decoded the resulting token IDs back into human-readable text. The model generated 20 additional tokens after the original prompt.
Common gotcha
Forgetting to set pad_token_id or eos_token_id causes silent warnings and can produce lower-quality text. Also, not setting max_new_tokens on models that don't naturally stop often results in padding repetition. Always explicitly set these parameters: don't rely on defaults.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceValueError: token_ids in `generated_ids` larger than vocabulary sizeOutOfMemoryError during generationExperienced dev note
The single biggest footgun: generate() defaults are tuned for diversity, not quality. In production, you almost always want deterministic output (temperature=0, top_k=None, top_p=1.0, do_sample=False) unless you have a specific reason for randomness. Test your temperature and sampling settings with realistic input: what works on a short prompt often fails on longer context.
Check your understanding
If you call model.generate() twice with identical input_ids but different temperature values, why would the outputs be different if neither uses sampling, and why would they be the same if both use temperature=0?
Show answer hint
A correct answer explains that temperature controls the probability distribution used to pick the next token: lower temperature makes high-probability tokens more likely to be picked. Temperature=0 means always pick the highest-probability token deterministically (no randomness), so identical inputs always produce identical outputs. Non-zero temperature with do_sample=True introduces randomness, so outputs differ. If do_sample=False (the default), even temperature != 0 doesn't introduce randomness; only max_new_tokens and the model's weights affect the output.