API Beginner easy · 5 min

top_p: nucleus sampling

What you will learn
Control output diversity by sampling from the smallest set of tokens that sum to a probability threshold, instead of picking the most likely token.

Why this matters

Using the right sampling strategy prevents robotic repetition while keeping responses coherent. top_p is often more intuitive than temperature for controlling creativity without manual tuning.

Skip if: Use temperature alone if you want direct control over confidence sharpness. Use top_k if you have a fixed budget of candidate tokens. Use top_p=1.0 (default) if you want deterministic, greedy decoding for reproducible outputs.

Explanation

top_p implements nucleus sampling: a technique where the API considers only tokens whose cumulative probability reaches the threshold you set. For example, if top_p=0.9, the model looks at tokens in descending probability order until their probabilities sum to 90%, then samples uniformly from that set.

Under the hood, the OpenAI API sorts all possible next tokens by probability, walks down the list adding probabilities together, and stops when it hits your threshold. The actual token chosen is then random from that nucleus. This differs from temperature, which reshapes the probability distribution globally: top_p is more like a dynamic cutoff that adapts to the model's actual confidence distribution.

Use top_p for conversational or creative tasks where you want coherence (close to 0.9) or more exploration (0.95+). Use smaller values (0.3–0.7) only if you need highly focused, narrow responses and are willing to lose semantic variety.

Request code

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

response = client.chat.completions.create(
    model='gpt-4-turbo',
    messages=[
        {
            'role': 'user',
            'content': 'Write a short poem about a rainy day.'
        }
    ],
    top_p=0.9,
    temperature=0.7,
    max_tokens=150
)

print(response.choices[0].message.content)

Authentication

Set the OPENAI_API_KEY environment variable before running your script, or pass it explicitly to OpenAI(api_key='sk-...').

Response shape

FieldDescription
choices List of completion objects; first element [0] contains the generated message
choices[0].message.content The actual text generated by the model
choices[0].finish_reason String: 'stop' if completed normally, 'length' if max_tokens hit
usage.prompt_tokens Integer count of tokens in your prompt
usage.completion_tokens Integer count of tokens in the response
model String confirming which model version was used

Field guide

choices[0].message.content

This is what you read to the user; strip() whitespace before displaying

finish_reason

If 'length', your response was cut off: raise max_tokens to get the complete answer. Often missed by developers who think the model 'finished early' when it actually hit a limit

usage

Track this across your API calls to forecast costs; multiply completion_tokens by the per-1K pricing and add to prompt_tokens cost

Setup trap

If you set os.environ['OPENAI_API_KEY'] in your script after importing OpenAI, but before calling OpenAI(), it works: the SDK reads the variable at instantiation time. However, if you call OpenAI() first without the key set, it will cache 'None' and subsequent os.environ assignments won't help. Always set the API key before creating the client object.

Cost

top_p itself has no cost penalty: you pay only per token generated. However, higher top_p values (e.g., 0.99) can lead to longer, more verbose outputs because the model has more token options, sometimes increasing completion_tokens by 10–20%. Monitor usage.completion_tokens to spot unintended expansion.

Rate limits

No special rate-limit behavior for top_p; it's just a parameter. Your rate limit depends on the plan (e.g., 3,500 requests/min for GPT-4 on free trial).

Common gotcha

Setting top_p=0.9 and temperature=2.0 together is contradictory: temperature=2.0 flattens the distribution so much that top_p's 90% threshold includes almost all tokens anyway, wasting the nucleus sampling. Keep temperature ≤1.0 when using top_p, or use one sampling strategy at a time.

Error recovery

openai.BadRequestError: invalid_request_error - extra inputs
You passed an unsupported parameter name (e.g., 'top_p_' instead of 'top_p'). Double-check spelling in the messages and parameters dict.
openai.AuthenticationError: invalid_api_key
Your OPENAI_API_KEY is missing, invalid, or expired. Run `echo $OPENAI_API_KEY` to verify it's set, and regenerate the key in your OpenAI dashboard if needed.
TypeError: unsupported operand type(s)
Likely you're passing top_p as a string instead of a float. Use top_p=0.9, not top_p='0.9'.

Experienced dev note

Most developers tune only temperature and forget top_p exists, then complain the model is 'too creative' even at temperature=1.0. In reality, they should lower top_p to 0.7–0.8 first before touching temperature. top_p is more predictable across models and use cases because it's adaptive: it depends on the model's actual probability distribution, not a fixed scaling factor. For production systems, set top_p=0.9–0.95 and temperature=0.7, then iterate only on top_p if you need more or less variety. This saves tuning time and makes your prompts more transferable across model updates.

Check your understanding

You set top_p=0.5 and top_p=0.9 for two otherwise identical requests. Which one will produce a more coherent (less chaotic) response, and why isn't the answer 'the one with temperature=1.0'?

Show answer hint

top_p=0.5 is more restrictive (fewer token options), so responses are more coherent. Temperature alone doesn't control coherence: it only reshapes probabilities. Nucleus sampling directly limits vocabulary, which is different.

VERSION openai 1.x SDK (current stable). Do not use deprecated openai.ChatCompletion.create() syntax; it will fail. Always use client.chat.completions.create().

Community Notes

No notes yetBe the first to share a version-specific fix or tip.