Greedy vs sampling generation
Why this matters
The generation strategy you choose determines whether your model produces repetitive, deterministic text (greedy) or creative, varied responses (sampling). This is fundamental to tuning model behavior for different use cases: chatbots need variety, code generation might need predictability.
Explanation
What it is: During text generation, a language model produces a probability distribution over all possible next tokens. Greedy decoding always selects the token with the highest probability; sampling decoding randomly picks a token from that distribution weighted by probability. How it works: At each step, the model outputs logits (raw scores) for all tokens. Greedy selection takes argmax(logits): always the same choice. Sampling applies a categorical distribution to the probabilities and draws a random token, meaning the same input can produce different outputs. When to use: Use greedy for deterministic tasks (translation, code completion) where consistency matters. Use sampling for creative tasks (storytelling, dialogue) where variety improves user experience.
Analogy
Imagine giving someone a multiple-choice question. Greedy is like always picking the answer you're most confident in: predictable but sometimes boring. Sampling is like randomly picking from answers you think are reasonable: more variety, but occasionally you pick something weird.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.float32
)
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)
print("=== GREEDY DECODING ===")
greedy_output = model.generate(
input_ids,
max_new_tokens=15,
do_sample=False,
temperature=1.0
)
greedy_text = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print(f"Output: {greedy_text}")
print("\n=== SAMPLING (temperature=0.7) ===")
torch.manual_seed(42)
sampling_output_1 = model.generate(
input_ids,
max_new_tokens=15,
do_sample=True,
temperature=0.7,
top_p=0.9
)
sampling_text_1 = tokenizer.decode(sampling_output_1[0], skip_special_tokens=True)
print(f"Output 1: {sampling_text_1}")
print("\n=== SAMPLING AGAIN (different seed) ===")
torch.manual_seed(99)
sampling_output_2 = model.generate(
input_ids,
max_new_tokens=15,
do_sample=True,
temperature=0.7,
top_p=0.9
)
sampling_text_2 = tokenizer.decode(sampling_output_2[0], skip_special_tokens=True)
print(f"Output 2: {sampling_text_2}")
print(f"\nGreedy and Sampling 1 are identical: {greedy_text == sampling_text_1}")
print(f"Sampling 1 and Sampling 2 are identical: {sampling_text_1 == sampling_text_2}") === GREEDY DECODING === Output: The future of AI is the most important thing in the world. It is the === SAMPLING (temperature=0.7) === Output 1: The future of AI is in the hands of those who understand its power and === SAMPLING AGAIN (different seed) === Output 2: The future of AI is bright, but we must be careful about how we use Greedy and Sampling 1 are identical: False Sampling 1 and Sampling 2 are identical: False
What just happened?
The code loaded a GPT-2 model and generated continuations of the prompt "The future of AI is" three times. First, greedy decoding selected the highest-probability token at each step, producing deterministic output. Then, sampling with `do_sample=True` and `temperature=0.7` randomly drew tokens from the probability distribution, producing different outputs each time despite the same input. The boolean checks confirm greedy gives one fixed answer while sampling produces variable results.
Common gotcha
Developers often forget that do_sample=False (greedy) ignores the temperature parameter: setting `temperature=0.5` with greedy decoding has no effect. Similarly, setting `temperature` very high (>2.0) with sampling can produce incoherent text because low-probability tokens become equally likely. Always verify your generation parameters actually change behavior.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceTypeError: generate() got an unexpected keyword argument 'do_sample'Warning: `temperature` is ignored when `do_sample=False`Experienced dev note
In transformers 5.5.x, the default `do_sample=False` means most developers accidentally use greedy decoding without realizing it. If your model outputs feel repetitive or boring in production, the first thing to check is whether you explicitly set `do_sample=True`. Also: temperature and `top_p` are not substitutes: use both together for best results. `top_p` (nucleus sampling) filters the token distribution to only high-probability candidates, then temperature scales that filtered distribution. This combination prevents both incoherent outputs (high temp without top_p) and repetition (no sampling).
Check your understanding
Why does running the sampling code twice with the same seed produce identical outputs, but running it without setting a seed produces different outputs? What would happen if you set `do_sample=True` but `temperature=0.0`?
Show answer hint
A correct answer explains that the random seed controls PyTorch's random number generator, so resetting the seed reproduces the same random choices. The second part requires understanding that `temperature=0.0` scales all probabilities toward zero (making the distribution extremely peaked), effectively converting sampling into greedy behavior: the highest-probability token dominates even though `do_sample=True` is technically active.