Cost of thinking tokens
Why this matters
Extended thinking allows Claude to reason through complex problems before responding, but developers must understand the billing structure to budget API calls correctly and avoid surprise costs in production.
Explanation
Extended thinking tokens are a separate token category in Claude's API that represent the model's internal reasoning process. When you enable extended thinking with budget_tokens, Claude allocates that token budget to think through a problem before generating a visible response. The API returns both thinking tokens (billed at 4x the standard input rate) and output tokens (billed at standard rates). Under the hood, the model's inference pipeline routes thinking work to a separate token accounting system: thinking tokens are generated but not visible in the response, only metered for cost. Use extended thinking when you're solving complex problems where step-by-step reasoning improves answer quality, debugging code, or handling ambiguous requirements where the model needs to reason through multiple interpretations before responding.
Request code
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=8000,
thinking={
"type": "enabled",
"budget_tokens": 5000
},
messages=[
{
"role": "user",
"content": "I have 5 red balls, 3 blue balls, and 2 green balls in a bag. If I draw without replacement, what's the probability of drawing 2 red balls in a row?"
}
]
)
print(f"Stop reason: {response.stop_reason}")
print(f"\nThinking tokens used: {response.usage.cache_creation_input_tokens}")
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
for block in response.content:
if block.type == "thinking":
print(f"\n[Internal reasoning - {len(block.thinking)} chars]")
elif block.type == "text":
print(f"\nResponse:\n{block.text}") Authentication
Set your Anthropic API key as an environment variable before running. The SDK reads it automatically: export ANTHROPIC_API_KEY='sk-ant-...'. No explicit authentication code is required: the Anthropic client constructor handles it.
Response shape
| Field | Description |
|---|---|
stop_reason | string - 'end_turn' or 'max_tokens' or 'stop_sequence' |
usage | [object Object] |
content | [object Object] |
Field guide
usage.cache_creation_input_tokens This counter-intuitively tracks thinking tokens, not cache creation. This is the number of tokens consumed from your budget_tokens allocation: the model's internal reasoning work.
content Array containing both thinking blocks (type='thinking') and text blocks (type='text'). Thinking blocks are NOT shown to users: they exist only for cost tracking and model reasoning.
cache_read_input_tokens If you reuse the same thinking context in a follow-up request with prompt caching enabled, this shows cached thinking tokens reused at lower cost (90% discount).
Setup trap
The thinking parameter requires budget_tokens to be set: passing thinking={"type": "enabled"} without budget_tokens will raise a validation error. The minimum budget is typically 1024 tokens; setting it too low means the model stops thinking mid-reasoning and triggers max_tokens stop reason instead of completing its analysis.
Cost
At April 2026 pricing for claude-opus-4-6: standard input = $3/MTok, thinking tokens = $12/MTok (4x multiplier), output = $15/MTok. A single request with 5000 thinking tokens + 2000 input tokens + 500 output tokens = (5000*$12 + 2000*$3 + 500*$15) / 1M = $0.0765 per request. Budget accordingly: extended thinking can 10-50x your per-request cost depending on budget size.
Rate limits
Extended thinking requests consume quota faster due to higher token counts. If you hit rate limits, check both token-per-minute and requests-per-minute limits. Thinking-enabled requests may hit limits sooner even if fewer requests are sent, because they consume more tokens.
Common gotcha
Developers often assume cache_creation_input_tokens represents prompt caching tokens: it actually represents thinking tokens used. Additionally, the 4x cost multiplier applies only to thinking tokens in the usage response; standard input and output tokens have normal pricing. A 5000-token thinking budget with 1000 regular input tokens = (5000 * 4) + 1000 tokens for billing purposes, not (5000 + 1000) * 4.
Error recovery
BadRequestError (budget_tokens below minimum)OverloadedErrorInvalidRequestError (thinking parameter without budget_tokens)Experienced dev note
Senior developers often enable thinking globally to 'improve quality' without measuring ROI. Track thinking token costs separately in your logging: add a cost_cents field to request logs: thinking_cost = (usage.cache_creation_input_tokens * 12) / 1_000_000 * 100. This reveals which use cases justify the 4x multiplier. You'll often find that thinking helps on 20% of queries but costs 60% of your budget. Use per-request feature flags or conditional logic based on query complexity to enable thinking only when it matters.
Check your understanding
If a user makes two requests with identical thinking prompts and prompt caching enabled, the second request shows cache_read_input_tokens > 0. How does this interact with thinking token costs, and what happens to the 4x multiplier for cached thinking tokens?
Show answer hint
Cached thinking tokens are read at 90% discount (0.9x the input rate), not at the 4x thinking token rate. The 4x multiplier applies only to new thinking tokens computed in cache_creation_input_tokens. Reusing thinking is actually cost-efficient even though thinking itself is expensive.