Haiku for classification and routing tasks
Why this matters
Production systems often need to route requests to different downstream handlers or classify user input before expensive operations. Haiku processes these decisions 3-5x faster and costs 90% less than Opus, making it ideal for the classification layer in multi-model pipelines.
Explanation
What Haiku does: Claude Haiku is Anthropic's fastest model, optimized for token-efficient tasks like categorization, routing, and structured extraction. It accepts the same API surface as Sonnet and Opus but processes requests in ~100ms and costs $0.80 per million input tokens (vs $3 for Sonnet).
How it works: Haiku uses a smaller, pruned version of Claude's architecture trained specifically for speed-critical tasks. When you specify model="claude-3-5-haiku-20241022", the Anthropic API routes your request to inference infrastructure optimized for low-latency batch processing. The model trades some reasoning depth for speed: perfect for classification, routing, and structured output generation where the decision space is bounded.
When to use: Deploy Haiku in request-filtering layers (spam detection, content moderation), multi-model routers ("route this to search vs summarization"), or classification pipelines processing 1000+ requests per minute. Pair Haiku with Sonnet/Opus using a two-stage pattern: Haiku classifies or routes, then the full model handles complex cases.
Request code
import anthropic
import json
client = anthropic.Anthropic()
def classify_support_ticket(ticket_text: str) -> dict:
"""Route support tickets to category using Haiku."""
message = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=100,
system="You are a support ticket classifier. Respond with only valid JSON (no markdown). Categories: billing, technical, product_feedback, account.",
messages=[
{
"role": "user",
"content": f"Classify this ticket:\n\n{ticket_text}"
}
]
)
response_text = message.content[0].text
try:
classification = json.loads(response_text)
except json.JSONDecodeError:
classification = {"category": "unknown", "confidence": 0.0}
return {
"category": classification.get("category"),
"confidence": classification.get("confidence", 0.5),
"input_tokens": message.usage.input_tokens,
"output_tokens": message.usage.output_tokens
}
ticket = "I was charged twice this month. My invoice shows two $99 charges on the same day."
result = classify_support_ticket(ticket)
print(json.dumps(result, indent=2)) Authentication
Set your Anthropic API key as an environment variable before running any code: export ANTHROPIC_API_KEY="sk-ant-...". The Anthropic SDK reads this automatically when instantiating the client. No manual header construction needed.
Response shape
| Field | Description |
|---|---|
category | string: one of: billing, technical, product_feedback, account |
confidence | float: model's confidence in classification (0.0-1.0) |
input_tokens | integer: tokens consumed by the prompt |
output_tokens | integer: tokens in the model's response |
Field guide
content[0].text The raw text response from Haiku: parse this to extract structured classification
usage.input_tokens Track this carefully: Haiku's cost advantage vanishes if your prompt is 5K tokens. Use sparse prompts or batch similar classifications together
usage.output_tokens Constrain max_tokens to 100-200 for classification: Haiku will be verbose if allowed, increasing cost
Setup trap
Many developers test Haiku with verbose prompts (1-2K tokens of examples) and conclude it's slow. Haiku's speed advantage only shows with lean prompts (<500 tokens). The default system prompt adds overhead: write minimal instructions: "Classify as X, Y, or Z. Respond with JSON." Testing with these short prompts reveals Haiku's true 100-150ms latency.
Cost
At scale, Haiku saves dramatically: 1M classification requests at $0.80 per 1M input tokens = $0.80 if each input averages 1000 tokens. Sonnet (at $3/1M) costs $3 for the same work. The $2.20 per million tickets compounds fast: a system routing 100K tickets/day saves ~$7K per month by using Haiku.
Rate limits
Haiku has the same rate limits as other models on your plan (5K requests/min, 40K tokens/min for free tier). However, because Haiku is token-efficient, high-volume classification workloads rarely hit token limits: you'll hit request-count limits first. If you do, batch your requests into 20-100 ticket classifications per API call using prompt batching.
Common gotcha
Haiku sometimes outputs filler text before JSON ("Here's the classification: {json}") when you ask for structured output. The code above handles JSONDecodeError, but the production fix is to use a stricter system prompt: "Respond with only a valid JSON object, no other text." and set max_tokens=150 so the model can't ramble.
Error recovery
json.JSONDecodeErrorRateLimitError (429)AuthenticationError (401)Experienced dev note
The real win isn't Haiku's speed in isolation: it's the two-tier architecture. Send 95% of requests (easy cases) to Haiku (100ms, $0.80/1M tokens), escalate 5% that Haiku flags as uncertain to Sonnet. You'll spend $0.20 on Haiku and $0.75 on Sonnet for a batch of 10K, vs $30 if you sent everything to Sonnet. Measure uncertainty via response format: if Haiku returns 'confidence: 0.65', route to Sonnet. This costs nothing extra and improves accuracy.
Check your understanding
Why does max_tokens=100 for Haiku classification improve both cost and response quality? What happens if you set max_tokens=2000 and ask for JSON output?
Show answer hint
Haiku will fill unused token budget with verbose reasoning or repeated JSON. Lower max_tokens forces precision. Also, the cost saved by Haiku shrinks if you allow it to output 2K tokens instead of 100.