Prefill for JSON structure enforcement
Why this matters
When you need structured output from Claude (database records, API responses, parsed documents), prefilling the assistant's response guarantees the format matches your schema from the first token, eliminating post-processing validation and retries.
Explanation
What prefill does: The messages parameter accepts a list where you can pre-populate the assistant's response prefix. Instead of Claude generating from a blank slate, you provide the opening tokens of its response: typically an opening brace { or [. Claude then continues from that point, completing the JSON structure you started. This is different from prompting; it's literally inserting tokens into the response before generation.
How it works: When you include a message with role="user" followed by a message with role="assistant" containing partial content, Claude treats that assistant message as already-started output. The model respects the prefix you've provided and generates only the remaining tokens needed to complete a valid response. The prefilled content counts against your output token limit but not your input token limit (it's amortized as context). The API returns the complete response including your prefilled prefix.
When to use it: Use prefill when you have a known JSON schema and want to eliminate hallucinated keys, wrong data types, or malformed structures. It's particularly useful for extraction tasks ("extract these 5 fields"), API response mocking, or batch processing where consistency matters. Avoid prefill for open-ended generation or when the output structure is genuinely unknown.
Request code
import anthropic
import json
client = anthropic.Anthropic()
user_message = "Extract the name, age, and occupation from this text: John Smith is a 34-year-old software engineer living in Portland."
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=256,
messages=[
{"role": "user", "content": user_message},
{
"role": "assistant",
"content": "{"
}
]
)
full_response = "{"
for block in response.content:
if hasattr(block, 'text'):
full_response += block.text
parsed = json.loads(full_response)
print(f"Name: {parsed.get('name')}")
print(f"Age: {parsed.get('age')}")
print(f"Occupation: {parsed.get('occupation')}") Authentication
Set your Anthropic API key before instantiation: ```bash export ANTHROPIC_API_KEY='sk-ant-...' ``` Or pass explicitly in code: ```python client = Anthropic(api_key='sk-ant-...') ``` The client reads the environment variable at instantiation time, not at request time.
Response shape
| Field | Description |
|---|---|
id | msg_unique_identifier |
type | message |
role | assistant |
content | [object Object] |
model | claude-opus-4-6 |
stop_reason | end_turn|max_tokens|stop_sequence |
stop_sequence | |
usage | [object Object] |
Field guide
content[0].text The assistant's completed response: includes your prefilled prefix plus Claude's generation. Always concatenate this with your prefill string to get the complete output.
stop_reason Critical field: 'end_turn' means Claude finished naturally. 'max_tokens' means output was cut off. 'stop_sequence' means a stop string was hit. If stop_reason is 'max_tokens' and you're parsing JSON, the result will be invalid: increase max_tokens.
usage.output_tokens Counts only Claude's generated tokens, not the prefilled portion. Prefilled content is part of the overall message but impacts billing differently: it's billed at your context window rate, not generation rate.
Setup trap
The prefill assistant message must have the exact same format as a real assistant message: role set to 'assistant' with content as a string. Do not use a system message for the prefill; it will be ignored. Do not nest the prefill content inside a list; pass it as a plain string. The most common error is placing the prefill in the wrong message slot or using the wrong role.
Cost
Prefilled content is billed as cache writes (at your cache write rate, typically 25% of standard input rate) on first use, then cache reads (90% discount) on subsequent identical prefills. A 10-token JSON prefix saved across 100 requests costs roughly 2.5 cents instead of 10 cents. For high-volume extractions with identical schemas, this saves significant money. However, cache requires the entire request to be identical; changing the user message breaks the cache hit.
Rate limits
Prefill doesn't bypass rate limits, but it does reduce output tokens generated per request, which indirectly improves throughput. If you're hitting token-per-minute limits, prefill is a legitimate optimization because it generates fewer output tokens for the same semantic result.
Common gotcha
The prefilled content is not automatically included in response.content[0].text: you receive only what Claude generated after your prefix. If you print response.content[0].text directly, you'll see `\"name\": \"John\", \"age\": 34, ...}` without the opening brace. You must concatenate the prefix yourself: `full_json = '{' + response.content[0].text`. Developers often forget this and get parsing errors.
Error recovery
json.JSONDecodeErroranthropic.APIError with 'stop_reason': 'max_tokens'anthropic.APIStatusError 401Empty or malformed content in responseExperienced dev note
Prefill is a performance and cost optimization that most devs miss entirely. Instead of writing complex prompt engineering to 'force' JSON, or using retry loops when parsing fails, prefill guarantees valid output on the first try. The hidden value: when you chain multiple Claude calls (e.g., extract → transform → enrich), prefilling the structure of each downstream call reduces hallucination and token waste. Prefill also plays well with prompt caching: identical prefills across requests trigger cache hits, cutting your per-request cost by 90% after the first call. Use it in production APIs where latency and cost matter.
Check your understanding
You're building a batch document extractor that processes 1,000 invoices per hour. Each invoice should return name, amount, and date as JSON. You use prefill to guarantee the output format. What happens to your token costs and response latency if you call the same prefill structure (opening brace) 1,000 times in a single hour? What should you consider to optimize further?
Show answer hint
Identical prefills within a 5-minute window trigger prompt caching. After the first request (cache write), the next 999 requests reuse the cached context at 90% discount. Your input tokens drop dramatically, but output tokens stay constant. To optimize further, you'd batch requests or use the Batch API to process all 1,000 invoices in a single API call with prefill, reducing the number of round-trips and cache overhead.