Code Beginner easy · 5 min

Greedy vs sampling generation

What you will learn

Greedy generation always picks the highest-probability next token; sampling randomly picks from the distribution to get diverse outputs.

Why this matters

The generation strategy you choose determines whether your model produces repetitive, deterministic text (greedy) or creative, varied responses (sampling). This is fundamental to tuning model behavior for different use cases: chatbots need variety, code generation might need predictability.

Skip if: Don't use sampling if you need reproducible, deterministic outputs for testing or compliance. Don't use greedy if your task requires creative or natural-sounding text (greedy often produces dull repetition).

Explanation

What it is: During text generation, a language model produces a probability distribution over all possible next tokens. Greedy decoding always selects the token with the highest probability; sampling decoding randomly picks a token from that distribution weighted by probability. How it works: At each step, the model outputs logits (raw scores) for all tokens. Greedy selection takes argmax(logits): always the same choice. Sampling applies a categorical distribution to the probabilities and draws a random token, meaning the same input can produce different outputs. When to use: Use greedy for deterministic tasks (translation, code completion) where consistency matters. Use sampling for creative tasks (storytelling, dialogue) where variety improves user experience.

Analogy

Imagine giving someone a multiple-choice question. Greedy is like always picking the answer you're most confident in: predictable but sometimes boring. Sampling is like randomly picking from answers you think are reasonable: more variety, but occasionally you pick something weird.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float32
)

prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

print("=== GREEDY DECODING ===")
greedy_output = model.generate(
    input_ids,
    max_new_tokens=15,
    do_sample=False,
    temperature=1.0
)
greedy_text = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print(f"Output: {greedy_text}")

print("\n=== SAMPLING (temperature=0.7) ===")
torch.manual_seed(42)
sampling_output_1 = model.generate(
    input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
sampling_text_1 = tokenizer.decode(sampling_output_1[0], skip_special_tokens=True)
print(f"Output 1: {sampling_text_1}")

print("\n=== SAMPLING AGAIN (different seed) ===")
torch.manual_seed(99)
sampling_output_2 = model.generate(
    input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
sampling_text_2 = tokenizer.decode(sampling_output_2[0], skip_special_tokens=True)
print(f"Output 2: {sampling_text_2}")

print(f"\nGreedy and Sampling 1 are identical: {greedy_text == sampling_text_1}")
print(f"Sampling 1 and Sampling 2 are identical: {sampling_text_1 == sampling_text_2}")

Output

=== GREEDY DECODING ===
Output: The future of AI is the most important thing in the world. It is the

=== SAMPLING (temperature=0.7) ===
Output 1: The future of AI is in the hands of those who understand its power and

=== SAMPLING AGAIN (different seed) ===
Output 2: The future of AI is bright, but we must be careful about how we use

Greedy and Sampling 1 are identical: False
Sampling 1 and Sampling 2 are identical: False

What just happened?

The code loaded a GPT-2 model and generated continuations of the prompt "The future of AI is" three times. First, greedy decoding selected the highest-probability token at each step, producing deterministic output. Then, sampling with `do_sample=True` and `temperature=0.7` randomly drew tokens from the probability distribution, producing different outputs each time despite the same input. The boolean checks confirm greedy gives one fixed answer while sampling produces variable results.

Common gotcha

Developers often forget that do_sample=False (greedy) ignores the temperature parameter: setting `temperature=0.5` with greedy decoding has no effect. Similarly, setting `temperature` very high (>2.0) with sampling can produce incoherent text because low-probability tokens become equally likely. Always verify your generation parameters actually change behavior.

Error recovery

RuntimeError: Expected all tensors to be on the same device

Input IDs were not moved to the same device as the model. Use `.to(model.device)` on input_ids, or use `device_map='auto'` when loading the model.

TypeError: generate() got an unexpected keyword argument 'do_sample'

This happens with transformers < 5.0. Update transformers: `pip install --upgrade transformers`. The transformers 5.5.x API requires explicit `do_sample=True/False`.

Warning: `temperature` is ignored when `do_sample=False`

This is expected behavior, not an error. Remove the `temperature` parameter when using greedy decoding, or set `do_sample=True` if you want temperature to have an effect.

Experienced dev note

In transformers 5.5.x, the default `do_sample=False` means most developers accidentally use greedy decoding without realizing it. If your model outputs feel repetitive or boring in production, the first thing to check is whether you explicitly set `do_sample=True`. Also: temperature and `top_p` are not substitutes: use both together for best results. `top_p` (nucleus sampling) filters the token distribution to only high-probability candidates, then temperature scales that filtered distribution. This combination prevents both incoherent outputs (high temp without top_p) and repetition (no sampling).

Check your understanding

Why does running the sampling code twice with the same seed produce identical outputs, but running it without setting a seed produces different outputs? What would happen if you set `do_sample=True` but `temperature=0.0`?

Show answer hint

A correct answer explains that the random seed controls PyTorch's random number generator, so resetting the seed reproduces the same random choices. The second part requires understanding that `temperature=0.0` scales all probabilities toward zero (making the distribution extremely peaked), effectively converting sampling into greedy behavior: the highest-probability token dominates even though `do_sample=True` is technically active.

VERSION In transformers 4.x, the default was sometimes `do_sample=True` depending on the model class and version. Transformers 5.5.x made `do_sample=False` (greedy) the explicit default across all models. Code written for 4.x that relied on implicit sampling will produce different output in 5.5.x unless you add `do_sample=True`.

Next, learn how <code>temperature</code> and <code>top_p</code> fine-tune the randomness of sampling to control output quality and coherence.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.