Sonnet for generation tasks
Why this matters
Sonnet balances cost and speed for production workloads: choosing the right model prevents overspending on Opus for tasks that don't need its reasoning depth, while avoiding quality loss from using a smaller model.
Explanation
What it does: Claude Sonnet 4.6 is Anthropic's mid-tier model: faster than Opus with 2x cost efficiency: designed for production generation tasks like email drafting, code generation, summarization, and data transformation. How it works: Sonnet uses the same transformer architecture as Opus but with reduced model capacity and optimized inference kernels, trading a small amount of reasoning capability for 3-5x lower latency and 50% cost reduction per token. The model maintains strong instruction-following and structured output capability through alignment training. When to use it: Choose Sonnet when you need fast, cost-effective generation at scale: batch processing documents, generating API responses, or creating variants of content. Use it when your task requires good writing quality and instruction adherence but not novel problem-solving.
Request code
from anthropic import Anthropic
client = Anthropic()
message = client.messages.create(
model='claude-sonnet-4-6',
max_tokens=1024,
messages=[
{
'role': 'user',
'content': 'Write a professional email requesting a meeting with a client named Alex about Q2 budget planning. Keep it to 3 sentences.'
}
]
)
print('Generated email:')
print(message.content[0].text)
print(f'\nTokens used - Input: {message.usage.input_tokens}, Output: {message.usage.output_tokens}') Authentication
Set your Anthropic API key as an environment variable before running code: export ANTHROPIC_API_KEY='sk-ant-...'. The SDK reads this automatically on client instantiation.
Response shape
| Field | Description |
|---|---|
id | msg_xxxxx (unique message identifier) |
type | message (always 'message') |
role | assistant |
content | [{"type": "text", "text": "Generated response here"}] |
model | claude-sonnet-4-6 |
stop_reason | end_turn (normal completion) or max_tokens (hit limit) |
stop_sequence | null or custom stop sequence if provided |
usage | [object Object] |
Field guide
stop_reason If 'max_tokens', your response was truncated: you need more tokens. If 'end_turn', generation completed naturally.
usage.cache_creation_input_tokens Tokens written to prompt cache on this request: only non-zero if you enabled prompt caching, saves money on repetitive requests.
usage.cache_read_input_tokens Tokens read from cache on this request: if non-zero, those tokens cost 90% less than regular input tokens.
Setup trap
The SDK reads ANTHROPIC_API_KEY at client instantiation time. If you set the environment variable after creating the Anthropic() client, the key won't be picked up: initialize the client after your environment is fully configured.
Cost
Sonnet costs $3/MTok input, $15/MTok output. For a 1K-token generation task at scale, expect ~$0.003-0.015 per request. Opus costs 3x more ($15/$75). Over 10K daily requests, switching from Opus to Sonnet saves ~$360/day.
Rate limits
Standard tier: 10K requests/min, 1M tokens/min. Sonnet's 3-5x lower latency means you'll hit request rate limits (not token limits) first. Implement exponential backoff on 429 responses; upgrade to higher tier if consistent batching exceeds 10K req/min.
Common gotcha
Developers often set max_tokens=2048 assuming Sonnet can handle it, but for generation tasks at scale, you'll hit rate limits (10K requests/min on standard tier). The real gotcha: not checking stop_reason == 'max_tokens' in production: this silently truncates responses in batch jobs without error.
Error recovery
RateLimitErrorInvalidRequestErrorAPIConnectionErrorAuthenticationErrorExperienced dev note
Cache your system prompts and few-shot examples in Sonnet calls. Prompt caching (via request headers) stores up to 4x 1M-token blocks per model, and cached tokens cost 90% less. For repetitive generation (email templates, code scaffolding), a single cached few-shot example pays for itself in 15-20 requests. Also: Sonnet's speed advantage over Opus appears in latency metrics (200-400ms vs 800ms-2s), not token throughput: use for user-facing APIs, not batch processing where you'd parallelize anyway.
Check your understanding
You're generating 5,000 customer support responses per day. Your current Opus-based system costs $450/day and takes 1.2 seconds per response. You switch to Sonnet at $0.15/response with 350ms latency. How much do you save daily, and why would the latency improvement matter more than the cost savings?
Show answer hint
Cost savings: ~$300/day ($450 - $150). Latency matters because 350ms allows real-time API responses to users (under 500ms perceived latency threshold); 1.2s forces you to queue requests or show loading spinners, degrading UX even though both are 'production-ready'.