Code Beginner easy · 5 min

do_sample=True: enabling sampling

What you will learn

Toggle between greedy decoding (always picking the highest probability token) and sampling (randomly selecting from the probability distribution) to control output diversity.

Why this matters

By default, language models use greedy decoding which produces repetitive, deterministic text. For creative or conversational tasks, you need <code>do_sample=True</code> to get varied, natural-sounding outputs. This is the single most important parameter for human-perceivable quality.

Skip if: Do NOT use sampling (<code>do_sample=True</code>) for structured tasks like code generation, SQL queries, or mathematical calculations where you need consistent, deterministic output. Also avoid it when latency is critical in production since sampling adds computational overhead.

Explanation

What it is: do_sample=True changes how a language model picks the next token during text generation. Instead of always choosing the token with the highest probability (greedy decoding), it randomly samples from the probability distribution.

How it works: When the model generates text, it outputs a probability distribution over all possible next tokens. With do_sample=False (default), it picks the highest probability every time: like always choosing heads on a coin flip. With do_sample=True, it respects the full distribution: like actually flipping the coin. A token with 70% probability gets picked 70% of the time, one with 20% gets picked 20% of the time. This randomness is what makes outputs diverse and natural-sounding.

When to use it: Use sampling for any task where you want creative variation: dialogue, storytelling, content generation, question answering. Keep it off for deterministic tasks where you need the same output every time.

Analogy

Imagine a chef making the same dish repeatedly. With greedy decoding, they always use their favorite ingredient for each step (highest probability), producing identical dishes every time. With sampling, they still favor their favorite ingredients, but occasionally reach for alternatives: producing subtle variations that feel fresh while staying true to the original recipe.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)

prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

print("=== Greedy Decoding (do_sample=False) ===")
greedy_output = model.generate(
    input_ids,
    max_length=20,
    do_sample=False,
    top_k=0,
    top_p=1.0
)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

print("\n=== Sampling (do_sample=True) ===")
for i in range(3):
    sampled_output = model.generate(
        input_ids,
        max_length=20,
        do_sample=True,
        top_k=0,
        top_p=1.0
    )
    print(f"Sample {i+1}: {tokenizer.decode(sampled_output[0], skip_special_tokens=True)}")

Output

=== Greedy Decoding (do_sample=False) ===
The future of AI is bright. We are living in a time of great technological change

=== Sampling (do_sample=True) ===
Sample 1: The future of AI is uncertain, but it will likely have a significant impact on
Sample 2: The future of AI is a topic that has been discussed extensively in the media and
Sample 3: The future of AI is full of possibilities, both positive and negative, and it will

What just happened?

The code generated text twice from the same prompt. First, with <code>do_sample=False</code>, it produced identical output every time because greedy decoding always picks the highest probability token. Second, with <code>do_sample=True</code>, each of the three generations produced different continuations because the model sampled from the probability distribution. All outputs are valid and grammatical, but varied.

Common gotcha

Developers often think do_sample=True alone makes output 'random': but it respects the learned probabilities. The real gotcha: do_sample=True with top_k=0, top_p=1.0 (no filtering) can still produce very low-probability tokens that sound weird. Use top_p=0.95 or top_k=50 alongside sampling to keep output grounded while maintaining diversity.

Error recovery

ValueError: `do_sample=True` requires `top_k > 0` or `top_p < 1.0`

This error occurs in some older API versions when sampling is enabled without probability filtering. Fix: add <code>top_p=0.95</code> or <code>top_k=50</code> to your generate() call.

Output is garbled or nonsensical

Sampling without filtering (top_k=0, top_p=1.0) can pick extremely low-probability tokens. Fix: set <code>top_p=0.9</code> (keep only tokens in top 90% of probability mass) to filter tail probabilities.

Output is identical across runs with do_sample=True

You likely did not set a random seed, OR the model is too small and has low entropy in the probability distribution. Fix: set <code>torch.manual_seed(42)</code> before generate() to make sampling reproducible if needed.

Experienced dev note

Senior developers know that sampling quality depends heavily on the probability distribution the model learned. A poorly-trained model can have a sharp, peaky distribution where sampling barely differs from greedy decoding. Conversely, a well-trained model like GPT-2 or larger produces smooth distributions where sampling creates genuinely diverse outputs. If you inherit code that turns sampling on but sees no improvement in diversity, the model may be undertrained, not your sampling settings. Also: in production, seed your sampling for debugging ('why did this user get output X?') but do NOT seed for live traffic, or every user gets the same variation.

Check your understanding

You set do_sample=True and top_p=0.9 on your model. If you run generate() three times on the same prompt, will you get three different outputs? Why or why not?

Show answer hint

A correct answer explains that yes, you will get different outputs because sampling is random, and that <code>top_p=0.9</code> filters which tokens are eligible for sampling (keeping the top 90% of probability mass). The randomness comes from the sampling process itself, not from any seeding or determinism in the filtering.

VERSION In transformers < 4.27.0, do_sample=True without top_k or top_p arguments would raise an error. Modern versions (4.27.0+, including 5.5.x) allow it but may warn. Always explicitly set filtering arguments for clarity and compatibility.

Now that you can enable sampling, learn how <code>temperature</code> controls how 'confident' or 'exploratory' the sampling distribution becomes: making the model more conservative or more creative.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.