API Advanced hard · 8 min

Precise token counting with the API

What you will learn

Use the <code>count_tokens()</code> method to measure exact token usage before sending requests, enabling precise cost prediction and context window management.

Why this matters

Token counting is essential for production systems because Claude's pricing is token-based, context windows are finite, and incorrect estimates lead to silent truncation or unexpected costs. Counting tokens before sending a request lets you validate that your prompt fits the model's context window and predict exactly what a request will cost.

Skip if: Don't use token counting when you're prototyping with small, simple prompts where you're confident the text fits (< 100 tokens). For production systems handling user-generated content, variable-length documents, or cost-sensitive operations, always count first.

Explanation

The count_tokens() method measures the exact number of tokens Claude will consume when processing your messages. Unlike estimation formulas (roughly 1 token per 4 characters), this API call uses Anthropic's actual tokenizer, accounting for message formatting, system prompts, tool definitions, and other factors that inflate token count beyond raw text length.

Under the hood, the API accepts the same message structure you'd send to messages.create(): including system prompts, images, tool definitions, and conversation history: and returns the precise token count without actually generating a response. This is a zero-cost metadata operation (no tokens charged to your quota) that runs synchronously, making it safe to call before every request.

Use token counting to: validate that multi-turn conversations won't exceed context limits before continuing a chat, estimate costs before processing large batches of documents, enforce maximum input sizes to prevent runaway costs from adversarial prompts, and debug why responses seem truncated (you'll discover the input was larger than expected).

Request code

python

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ.get('ANTHROPIC_API_KEY'))

messages = [
    {
        'role': 'user',
        'content': 'Write a detailed essay on quantum computing and its applications in cryptography. Include historical context, current state-of-the-art techniques, and future implications.'
    }
]

system_prompt = 'You are an expert in quantum physics and computer science. Provide thorough, technically accurate responses with citations where appropriate.'

token_response = client.messages.count_tokens(
    model='claude-opus-4-6',
    system=system_prompt,
    messages=messages
)

print(f'Input tokens: {token_response.input_tokens}')
print(f'Cache creation tokens: {token_response.cache_creation_input_tokens}')
print(f'Cache read tokens: {token_response.cache_read_input_tokens}')

cost_per_input = 0.015 / 1_000_000
estimated_cost = token_response.input_tokens * cost_per_input
print(f'Estimated input cost: ${estimated_cost:.6f}')

Authentication

Ensure your ANTHROPIC_API_KEY environment variable is set before instantiating the Anthropic client. The SDK reads this key at client initialization time, not at method call time. Set it before importing or instantiating: os.environ['ANTHROPIC_API_KEY'] = 'sk-ant-...' or export it in your shell before running Python.

Response shape

Field	Description
`input_tokens`	integer: total tokens in the system prompt + all messages
`cache_creation_input_tokens`	integer: tokens that would be cached if prompt caching is enabled (subset of input_tokens)
`cache_read_input_tokens`	integer: tokens read from cache on this call (0 if first call or no cache hit)

Field guide

input_tokens

The primary field: this is what you'll be charged for when you actually call messages.create() with identical inputs. Use this for cost prediction.

cache_creation_input_tokens

Often overlooked: if you're using prompt caching, this tells you how many of your input tokens will be written to the cache on the next request. Cached tokens cost 90% less on reads, so large cache_creation values = future savings.

cache_read_input_tokens

A hidden optimization flag: if this is > 0, it means previous requests cached parts of your prompt, and this request benefited from that cache. Zero on first calls, but reveals cost savings in multi-turn conversations with stable system prompts.

Setup trap

Token counts are model-specific. If you count tokens with model='claude-opus-4-6' but send the request to model='claude-sonnet-4-6', your counts are invalid: these models have different tokenizers. The API won't error; you'll just have wrong predictions. Always use the same model string for both count_tokens() and messages.create().

Cost

As of April 2026, claude-opus-4-6 input is $0.015 per 1M tokens. A 100k-token request costs $1.50 in input alone. If you're processing batches of documents, calling count_tokens() first lets you filter or truncate documents that would exceed your budget per-request. The count_tokens() call itself is free (doesn't deduct from quota), so the ROI is immediate.

Rate limits

Token counting requests share the same rate limit bucket as message requests (currently 10k requests per minute for most accounts). If you're counting tokens for every request in a batch pipeline, you might hit rate limits before your message quota. Consider batching: count tokens for 10 prompts in parallel, then send the cheap ones and skip the expensive ones, rather than 1-by-1.

Common gotcha

Developers count tokens for the user message alone, forgetting to include the system prompt in the count. The system prompt is prepended to every message and consumes tokens: omitting it from count_tokens() underestimates your true token usage by 5-15%. Always pass system= to count_tokens() if you use a system prompt in messages.create().

Error recovery

AuthenticationError

ANTHROPIC_API_KEY is missing, invalid, or expired. Verify with: curl -H 'x-api-key: YOUR_KEY' https://api.anthropic.com/v1/messages/count_tokens (should 400 with invalid model, not 401). If it 401s, your key is wrong.

InvalidRequestError with 'model' in message

You passed an unsupported model string (e.g., 'claude-3-sonnet' which is retired). Use 'claude-opus-4-6' or 'claude-sonnet-4-6'. Check your string for typos.

BadRequestError: 'messages' must be a list

Passed messages as a dict or string instead of a list of message objects. Each message must be: {'role': 'user'|'assistant', 'content': 'text_or_list'}. If you're building messages dynamically, wrap in a list.

APIError: invalid_request_error with 'system' in message

The system parameter expects a string, not a list or dict. Even if you use a list of content blocks in messages.create(), system must be a single string in count_tokens().

Experienced dev note

Token counting is your early-warning system for two hidden costs: (1) Cache thrashing: if cache_creation_input_tokens is huge but you're not actually enabling prompt caching, you're leaving 90% cost savings on the table. Check if your system prompt + context is stable enough to cache. (2) Context bloat in retrieval pipelines: when you embed a 20-document context into every request, token counts explode. Count before each request in production, then log the distribution. If p95 token count is 80% of your context window, your system is fragile. Trim earlier.

Check your understanding

You're building a chatbot that processes user messages with a 5KB system prompt and stores the last 10 messages in a conversation. You call count_tokens() on the first user message and get 2,500 tokens. On the fifth turn of the conversation, what should you expect count_tokens() to return, and why? Would prompt caching reduce this number?

Show answer hint

The token count will be significantly higher on turn 5 because you're including the entire conversation history (previous 4 user messages + 4 assistant responses) plus the system prompt. Prompt caching would help only if the system prompt and earlier messages were identical across different users or conversations: but within a single conversation, caching doesn't help because the cached prompt keeps growing with new messages.

VERSION anthropic SDK 0.94.x uses messages.count_tokens() (part of the main client). In 0.3.x-0.93.x, this was available as client.messages.count_tokens(). The API response shape is stable. As of April 2026, cache_creation_input_tokens and cache_read_input_tokens were added in SDK 0.90.x to support prompt caching; upgrade if you're using an older version and need cache metrics.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.