API Intermediate medium · 6 min

Cost of thinking tokens

What you will learn

Extended thinking tokens cost 4x more than regular input tokens and are billed separately in the response, requiring cost-aware API integration.

Why this matters

Extended thinking allows Claude to reason through complex problems before responding, but developers must understand the billing structure to budget API calls correctly and avoid surprise costs in production.

Skip if: Don't use extended thinking for simple queries, fast-response requirements, or tasks where reasoning overhead doesn't add value. Use standard messages.create() without budget_tokens for straightforward operations.

Explanation

Extended thinking tokens are a separate token category in Claude's API that represent the model's internal reasoning process. When you enable extended thinking with budget_tokens, Claude allocates that token budget to think through a problem before generating a visible response. The API returns both thinking tokens (billed at 4x the standard input rate) and output tokens (billed at standard rates). Under the hood, the model's inference pipeline routes thinking work to a separate token accounting system: thinking tokens are generated but not visible in the response, only metered for cost. Use extended thinking when you're solving complex problems where step-by-step reasoning improves answer quality, debugging code, or handling ambiguous requirements where the model needs to reason through multiple interpretations before responding.

Request code

python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=8000,
    thinking={
        "type": "enabled",
        "budget_tokens": 5000
    },
    messages=[
        {
            "role": "user",
            "content": "I have 5 red balls, 3 blue balls, and 2 green balls in a bag. If I draw without replacement, what's the probability of drawing 2 red balls in a row?"
        }
    ]
)

print(f"Stop reason: {response.stop_reason}")
print(f"\nThinking tokens used: {response.usage.cache_creation_input_tokens}")
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

for block in response.content:
    if block.type == "thinking":
        print(f"\n[Internal reasoning - {len(block.thinking)} chars]")
    elif block.type == "text":
        print(f"\nResponse:\n{block.text}")

Authentication

Set your Anthropic API key as an environment variable before running. The SDK reads it automatically: export ANTHROPIC_API_KEY='sk-ant-...'. No explicit authentication code is required: the Anthropic client constructor handles it.

Response shape

Field	Description
`stop_reason`	string - 'end_turn' or 'max_tokens' or 'stop_sequence'
`usage`	[object Object]
`content`	[object Object]

Field guide

usage.cache_creation_input_tokens

This counter-intuitively tracks thinking tokens, not cache creation. This is the number of tokens consumed from your budget_tokens allocation: the model's internal reasoning work.

content

Array containing both thinking blocks (type='thinking') and text blocks (type='text'). Thinking blocks are NOT shown to users: they exist only for cost tracking and model reasoning.

cache_read_input_tokens

If you reuse the same thinking context in a follow-up request with prompt caching enabled, this shows cached thinking tokens reused at lower cost (90% discount).

Setup trap

The thinking parameter requires budget_tokens to be set: passing thinking={"type": "enabled"} without budget_tokens will raise a validation error. The minimum budget is typically 1024 tokens; setting it too low means the model stops thinking mid-reasoning and triggers max_tokens stop reason instead of completing its analysis.

Cost

At April 2026 pricing for claude-opus-4-6: standard input = $3/MTok, thinking tokens = $12/MTok (4x multiplier), output = $15/MTok. A single request with 5000 thinking tokens + 2000 input tokens + 500 output tokens = (5000*$12 + 2000*$3 + 500*$15) / 1M = $0.0765 per request. Budget accordingly: extended thinking can 10-50x your per-request cost depending on budget size.

Rate limits

Extended thinking requests consume quota faster due to higher token counts. If you hit rate limits, check both token-per-minute and requests-per-minute limits. Thinking-enabled requests may hit limits sooner even if fewer requests are sent, because they consume more tokens.

Common gotcha

Developers often assume cache_creation_input_tokens represents prompt caching tokens: it actually represents thinking tokens used. Additionally, the 4x cost multiplier applies only to thinking tokens in the usage response; standard input and output tokens have normal pricing. A 5000-token thinking budget with 1000 regular input tokens = (5000 * 4) + 1000 tokens for billing purposes, not (5000 + 1000) * 4.

Error recovery

BadRequestError (budget_tokens below minimum)

Increase budget_tokens to at least 1024. If you want less reasoning, consider removing extended thinking entirely and using standard messages.create().

OverloadedError

Extended thinking is computationally expensive server-side. Reduce budget_tokens by 25-50%, or batch requests with longer delays between them.

InvalidRequestError (thinking parameter without budget_tokens)

Add budget_tokens field: {"type": "enabled", "budget_tokens": 2048}. Every thinking-enabled request must specify a budget.

Experienced dev note

Senior developers often enable thinking globally to 'improve quality' without measuring ROI. Track thinking token costs separately in your logging: add a cost_cents field to request logs: thinking_cost = (usage.cache_creation_input_tokens * 12) / 1_000_000 * 100. This reveals which use cases justify the 4x multiplier. You'll often find that thinking helps on 20% of queries but costs 60% of your budget. Use per-request feature flags or conditional logic based on query complexity to enable thinking only when it matters.

Check your understanding

If a user makes two requests with identical thinking prompts and prompt caching enabled, the second request shows cache_read_input_tokens > 0. How does this interact with thinking token costs, and what happens to the 4x multiplier for cached thinking tokens?

Show answer hint

Cached thinking tokens are read at 90% discount (0.9x the input rate), not at the 4x thinking token rate. The 4x multiplier applies only to new thinking tokens computed in cache_creation_input_tokens. Reusing thinking is actually cost-efficient even though thinking itself is expensive.

VERSION Extended thinking (with thinking parameter) was introduced in anthropic 0.87.x and is stable in 0.94.x. The field name cache_creation_input_tokens is a historical artifact from prompt caching naming: it tracks thinking tokens, not cache creation. This naming may change in anthropic 1.0.x.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.