How to Intermediate · 3 min read

How thinking tokens affect model performance

Quick answer
Thinking tokens are special tokens or prompts inserted to guide a model's internal reasoning process, effectively increasing the context length used for problem-solving. By allocating more tokens to 'thinking' steps, models like gpt-4o or claude-3-5-sonnet-20241022 can improve reasoning accuracy and reduce errors, but excessive thinking tokens may increase latency and token usage costs.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Understanding thinking tokens

Thinking tokens are tokens used to explicitly prompt a large language model (LLM) to perform intermediate reasoning steps before producing a final answer. They act like 'mental scratch paper' that the model uses internally to break down complex problems into smaller parts. This technique is often implemented as chain-of-thought prompting, where the model is encouraged to 'think aloud' by generating reasoning tokens before the final output.

These tokens increase the effective context length, allowing the model to handle multi-step logic, arithmetic, or code generation more accurately.

Step by step example with OpenAI SDK

This example shows how to use thinking tokens by prompting gpt-4o to solve a math problem with chain-of-thought reasoning.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "user", "content": "Q: If I have 3 apples and buy 2 more, how many apples do I have? Think step-by-step."}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=100
)

print(response.choices[0].message.content)
output
Step 1: I start with 3 apples.
Step 2: I buy 2 more apples.
Step 3: Total apples = 3 + 2 = 5.
Answer: You have 5 apples.

Common variations and trade-offs

Using more thinking tokens (longer chain-of-thought) improves reasoning but increases latency and token cost. Models like claude-3-5-sonnet-20241022 excel at multi-step reasoning with fewer tokens compared to smaller models.

Variations include:

  • Adjusting max_tokens to allow longer reasoning chains.
  • Using explicit prompts like "Let's think step-by-step" to trigger internal reasoning.
  • Streaming partial outputs to get intermediate reasoning tokens in real-time.

Troubleshooting thinking tokens

If the model output is cut off or incomplete, increase max_tokens to allow more thinking tokens.

If reasoning is shallow or incorrect, try more explicit chain-of-thought prompts or switch to a stronger reasoning model like claude-sonnet-4-5.

High latency may result from very long thinking tokens; balance token length with performance needs.

Key Takeaways

  • Thinking tokens enable LLMs to perform multi-step reasoning by expanding context with intermediate steps.
  • More thinking tokens improve accuracy but increase latency and token usage costs.
  • Explicit chain-of-thought prompts trigger better internal reasoning in models like gpt-4o and claude-3-5-sonnet-20241022.
  • Adjust max_tokens to balance reasoning depth and performance.
  • Use stronger reasoning models for complex tasks requiring extensive thinking tokens.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, claude-sonnet-4-5
Verify ↗