do_sample=True: enabling sampling
Why this matters
By default, language models use greedy decoding which produces repetitive, deterministic text. For creative or conversational tasks, you need <code>do_sample=True</code> to get varied, natural-sounding outputs. This is the single most important parameter for human-perceivable quality.
Explanation
What it is: do_sample=True changes how a language model picks the next token during text generation. Instead of always choosing the token with the highest probability (greedy decoding), it randomly samples from the probability distribution.
How it works: When the model generates text, it outputs a probability distribution over all possible next tokens. With do_sample=False (default), it picks the highest probability every time: like always choosing heads on a coin flip. With do_sample=True, it respects the full distribution: like actually flipping the coin. A token with 70% probability gets picked 70% of the time, one with 20% gets picked 20% of the time. This randomness is what makes outputs diverse and natural-sounding.
When to use it: Use sampling for any task where you want creative variation: dialogue, storytelling, content generation, question answering. Keep it off for deterministic tasks where you need the same output every time.
Analogy
Imagine a chef making the same dish repeatedly. With greedy decoding, they always use their favorite ingredient for each step (highest probability), producing identical dishes every time. With sampling, they still favor their favorite ingredients, but occasionally reach for alternatives: producing subtle variations that feel fresh while staying true to the original recipe.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("=== Greedy Decoding (do_sample=False) ===")
greedy_output = model.generate(
input_ids,
max_length=20,
do_sample=False,
top_k=0,
top_p=1.0
)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
print("\n=== Sampling (do_sample=True) ===")
for i in range(3):
sampled_output = model.generate(
input_ids,
max_length=20,
do_sample=True,
top_k=0,
top_p=1.0
)
print(f"Sample {i+1}: {tokenizer.decode(sampled_output[0], skip_special_tokens=True)}") === Greedy Decoding (do_sample=False) === The future of AI is bright. We are living in a time of great technological change === Sampling (do_sample=True) === Sample 1: The future of AI is uncertain, but it will likely have a significant impact on Sample 2: The future of AI is a topic that has been discussed extensively in the media and Sample 3: The future of AI is full of possibilities, both positive and negative, and it will
What just happened?
The code generated text twice from the same prompt. First, with <code>do_sample=False</code>, it produced identical output every time because greedy decoding always picks the highest probability token. Second, with <code>do_sample=True</code>, each of the three generations produced different continuations because the model sampled from the probability distribution. All outputs are valid and grammatical, but varied.
Common gotcha
Developers often think do_sample=True alone makes output 'random': but it respects the learned probabilities. The real gotcha: do_sample=True with top_k=0, top_p=1.0 (no filtering) can still produce very low-probability tokens that sound weird. Use top_p=0.95 or top_k=50 alongside sampling to keep output grounded while maintaining diversity.
Error recovery
ValueError: `do_sample=True` requires `top_k > 0` or `top_p < 1.0`Output is garbled or nonsensicalOutput is identical across runs with do_sample=TrueExperienced dev note
Senior developers know that sampling quality depends heavily on the probability distribution the model learned. A poorly-trained model can have a sharp, peaky distribution where sampling barely differs from greedy decoding. Conversely, a well-trained model like GPT-2 or larger produces smooth distributions where sampling creates genuinely diverse outputs. If you inherit code that turns sampling on but sees no improvement in diversity, the model may be undertrained, not your sampling settings. Also: in production, seed your sampling for debugging ('why did this user get output X?') but do NOT seed for live traffic, or every user gets the same variation.
Check your understanding
You set do_sample=True and top_p=0.9 on your model. If you run generate() three times on the same prompt, will you get three different outputs? Why or why not?
Show answer hint
A correct answer explains that yes, you will get different outputs because sampling is random, and that <code>top_p=0.9</code> filters which tokens are eligible for sampling (keeping the top 90% of probability mass). The randomness comes from the sampling process itself, not from any seeding or determinism in the filtering.
do_sample=True without top_k or top_p arguments would raise an error. Modern versions (4.27.0+, including 5.5.x) allow it but may warn. Always explicitly set filtering arguments for clarity and compatibility.