API Intermediate medium · 6 min

Prefill for JSON structure enforcement

What you will learn

Use the Messages API prefill feature to enforce Claude to output valid JSON matching a specific schema before generation completes.

Why this matters

When you need structured output from Claude (database records, API responses, parsed documents), prefilling the assistant's response guarantees the format matches your schema from the first token, eliminating post-processing validation and retries.

Skip if: If you only need occasional structured extraction and can afford retries, or if your data is already semi-structured (CSV, tables), use simple string prompting first. If you need guaranteed JSON with complex conditional fields, use Claude's native JSON mode via system prompt instead. If you're building a simple chatbot without structured output requirements, prefill adds unnecessary complexity.

Explanation

What prefill does: The messages parameter accepts a list where you can pre-populate the assistant's response prefix. Instead of Claude generating from a blank slate, you provide the opening tokens of its response: typically an opening brace { or [. Claude then continues from that point, completing the JSON structure you started. This is different from prompting; it's literally inserting tokens into the response before generation.

How it works: When you include a message with role="user" followed by a message with role="assistant" containing partial content, Claude treats that assistant message as already-started output. The model respects the prefix you've provided and generates only the remaining tokens needed to complete a valid response. The prefilled content counts against your output token limit but not your input token limit (it's amortized as context). The API returns the complete response including your prefilled prefix.

When to use it: Use prefill when you have a known JSON schema and want to eliminate hallucinated keys, wrong data types, or malformed structures. It's particularly useful for extraction tasks ("extract these 5 fields"), API response mocking, or batch processing where consistency matters. Avoid prefill for open-ended generation or when the output structure is genuinely unknown.

Request code

python

import anthropic
import json

client = anthropic.Anthropic()

user_message = "Extract the name, age, and occupation from this text: John Smith is a 34-year-old software engineer living in Portland."

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=256,
    messages=[
        {"role": "user", "content": user_message},
        {
            "role": "assistant",
            "content": "{"
        }
    ]
)

full_response = "{"
for block in response.content:
    if hasattr(block, 'text'):
        full_response += block.text

parsed = json.loads(full_response)
print(f"Name: {parsed.get('name')}")
print(f"Age: {parsed.get('age')}")
print(f"Occupation: {parsed.get('occupation')}")

Authentication

Set your Anthropic API key before instantiation: ```bash export ANTHROPIC_API_KEY='sk-ant-...' ``` Or pass explicitly in code: ```python client = Anthropic(api_key='sk-ant-...') ``` The client reads the environment variable at instantiation time, not at request time.

Response shape

Field	Description
`id`	msg_unique_identifier
`type`	message
`role`	assistant
`content`	[object Object]
`model`	claude-opus-4-6
`stop_reason`	end_turn\|max_tokens\|stop_sequence
`stop_sequence`
`usage`	[object Object]

Field guide

content[0].text

The assistant's completed response: includes your prefilled prefix plus Claude's generation. Always concatenate this with your prefill string to get the complete output.

stop_reason

Critical field: 'end_turn' means Claude finished naturally. 'max_tokens' means output was cut off. 'stop_sequence' means a stop string was hit. If stop_reason is 'max_tokens' and you're parsing JSON, the result will be invalid: increase max_tokens.

usage.output_tokens

Counts only Claude's generated tokens, not the prefilled portion. Prefilled content is part of the overall message but impacts billing differently: it's billed at your context window rate, not generation rate.

Setup trap

The prefill assistant message must have the exact same format as a real assistant message: role set to 'assistant' with content as a string. Do not use a system message for the prefill; it will be ignored. Do not nest the prefill content inside a list; pass it as a plain string. The most common error is placing the prefill in the wrong message slot or using the wrong role.

Cost

Prefilled content is billed as cache writes (at your cache write rate, typically 25% of standard input rate) on first use, then cache reads (90% discount) on subsequent identical prefills. A 10-token JSON prefix saved across 100 requests costs roughly 2.5 cents instead of 10 cents. For high-volume extractions with identical schemas, this saves significant money. However, cache requires the entire request to be identical; changing the user message breaks the cache hit.

Rate limits

Prefill doesn't bypass rate limits, but it does reduce output tokens generated per request, which indirectly improves throughput. If you're hitting token-per-minute limits, prefill is a legitimate optimization because it generates fewer output tokens for the same semantic result.

Common gotcha

The prefilled content is not automatically included in response.content[0].text: you receive only what Claude generated after your prefix. If you print response.content[0].text directly, you'll see `\"name\": \"John\", \"age\": 34, ...}` without the opening brace. You must concatenate the prefix yourself: `full_json = '{' + response.content[0].text`. Developers often forget this and get parsing errors.

Error recovery

json.JSONDecodeError

Your prefill or Claude's continuation produced invalid JSON. Verify the prefill string is valid JSON syntax (usually just '{' or '['). Increase max_tokens: if stop_reason is 'max_tokens', the JSON was truncated. Regenerate and check response.stop_reason before parsing.

anthropic.APIError with 'stop_reason': 'max_tokens'

Claude ran out of token budget before completing the JSON. Increase max_tokens parameter in the create() call. Start with max_tokens=2048 and adjust based on expected output size.

anthropic.APIStatusError 401

API key is missing or invalid. Verify ANTHROPIC_API_KEY environment variable is set correctly. Print client.api_key (first 10 chars only) to debug without leaking the full key.

Empty or malformed content in response

The prefill message format was incorrect. Ensure the assistant message content is a plain string ("{"), not a list or dict. The messages list should be [user_message, assistant_prefill_message].

Experienced dev note

Prefill is a performance and cost optimization that most devs miss entirely. Instead of writing complex prompt engineering to 'force' JSON, or using retry loops when parsing fails, prefill guarantees valid output on the first try. The hidden value: when you chain multiple Claude calls (e.g., extract → transform → enrich), prefilling the structure of each downstream call reduces hallucination and token waste. Prefill also plays well with prompt caching: identical prefills across requests trigger cache hits, cutting your per-request cost by 90% after the first call. Use it in production APIs where latency and cost matter.

Check your understanding

You're building a batch document extractor that processes 1,000 invoices per hour. Each invoice should return name, amount, and date as JSON. You use prefill to guarantee the output format. What happens to your token costs and response latency if you call the same prefill structure (opening brace) 1,000 times in a single hour? What should you consider to optimize further?

Show answer hint

Identical prefills within a 5-minute window trigger prompt caching. After the first request (cache write), the next 999 requests reuse the cached context at 90% discount. Your input tokens drop dramatically, but output tokens stay constant. To optimize further, you'd batch requests or use the Batch API to process all 1,000 invoices in a single API call with prefill, reducing the number of round-trips and cache overhead.

VERSION Prefill via assistant message has been available since Claude 3.5 and works identically in anthropic 0.94.x. The Messages API (not the deprecated Completions API) is required. Do not confuse prefill with the 'system' role: system messages cannot be prefilled and do not enforce output structure.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.