Nested object extraction
Why this matters
Real-world data is rarely flat: invoices have line items, documents have sections with subsections, APIs return nested responses. Extracting nested structures with Claude eliminates the need for multiple API calls or fragile regex parsing, reducing latency and cost while improving accuracy.
Explanation
What it does: Claude's structured output mode accepts a JSON schema that defines nested objects and validates the model's response against that schema. When you provide a schema with nested properties, Claude returns a complete, valid JSON object matching your structure: no parsing needed.
How it works: You define a Pydantic model or raw JSON schema with nested fields (using properties and $defs). The API processes your prompt and returns a parsed response where the entire object hierarchy is guaranteed valid. The Claude model understands the nesting depth and extracts relationships between parent and child objects correctly, without hallucinating extra or missing levels.
When to use it: Use this when extracting structured data from unstructured documents (contracts, medical records, customer feedback), converting prose into databases, or enriching messy input. The nested pattern shines when your extraction has 2+ levels of hierarchy and accuracy matters more than speed.
Request code
from anthropic import Anthropic
import json
from typing import Optional
client = Anthropic()
text_to_extract = """
Invoice #2026-001 from Acme Corp dated April 15, 2026.
Line items:
- 5 units of Widget A at $10 each, shipped to New York
- 2 units of Widget B at $25 each, expedited to California
Notes: Customer is a returning client with 10% loyalty discount applied.
"""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Extract the invoice data from this text and return it as JSON:\n\n{text_to_extract}"
}
],
tools=[
{
"name": "extract_invoice",
"description": "Extracts structured invoice data including line items and metadata",
"input_schema": {
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "Invoice ID"
},
"vendor": {
"type": "string",
"description": "Vendor name"
},
"date": {
"type": "string",
"description": "Invoice date"
},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"product_name": {
"type": "string"
},
"quantity": {
"type": "integer"
},
"unit_price": {
"type": "number"
},
"destination": {
"type": "string"
},
"shipping_method": {
"type": "string",
"enum": ["standard", "expedited"]
}
},
"required": ["product_name", "quantity", "unit_price"]
}
},
"metadata": {
"type": "object",
"properties": {
"is_returning_customer": {
"type": "boolean"
},
"discount_applied": {
"type": "number",
"description": "Discount percentage"
},
"notes": {
"type": "string",
"description": "Additional notes"
}
}
}
},
"required": ["invoice_number", "vendor", "date", "line_items"]
}
}
]
)
if response.content[0].type == "tool_use":
extracted_data = response.content[0].input
print("Extracted Invoice:")
print(json.dumps(extracted_data, indent=2))
else:
print("No tool call in response")
print(response.content[0].text) Authentication
Set your Anthropic API key as an environment variable before running: export ANTHROPIC_API_KEY='sk-ant-...'. The Anthropic client reads this automatically on instantiation.
Response shape
| Field | Description |
|---|---|
content | list of content blocks |
content[0].type | tool_use (when schema validation succeeds) |
content[0].input | parsed nested object matching your schema |
content[0].input.invoice_number | string |
content[0].input.line_items | array of objects with product_name, quantity, unit_price, destination, shipping_method |
content[0].input.metadata | object with is_returning_customer, discount_applied, notes |
stop_reason | tool_use (indicates successful schema match) |
Field guide
content[0].input The actual extracted data: this is what you care about. It's already parsed as a Python dict, not a string.
stop_reason Will be 'tool_use' on success. If it's 'end_turn' or 'max_tokens', the model didn't call your tool, meaning extraction failed or the schema was misunderstood.
line_items Array of nested objects. The model correctly groups related fields (quantity with unit_price, not mixing across items). This is where the nesting magic happens.
metadata Optional nested object many developers miss: it captures context about the extraction (customer status, discounts) that enriches the primary data without cluttering the main schema.
Setup trap
The tools parameter is required even though this looks like structured output, not tool use. Without declaring the tool, Claude ignores your schema. Also, the nested schema must include required fields at each level: omitting them leaves optional fields that the model may skip, breaking your downstream parsing.
Cost
Nested extraction costs the same as a standard API call (input + output tokens). However, extracting multiple flat objects separately costs more: one nested call is cheaper than N flat calls. For an invoice with 10 line items, one nested extraction (~500 output tokens) beats 10 separate calls (~1,000 tokens minimum).
Rate limits
Not a concern for this specific operation: extraction doesn't hit rate limits harder than any other call. However, if you're extracting from thousands of documents in parallel, batch them: use the Batch API for 50% cost savings.
Common gotcha
Developers often check response.content[0].text expecting text, but tool_use responses don't have a text field: they have an input field containing the parsed object. Checking the wrong field returns None or crashes.
Error recovery
KeyError on content[0].inputValidationError in nested propertiesTypeError: object is not subscriptableExperienced dev note
Nested extraction beats prompt engineering. Instead of writing elaborate instructions to handle complex hierarchies, let the schema enforce structure. The model is smarter with a schema than with prose instructions. Also: always make nested fields optional unless absolutely required: it gives Claude wiggle room when the input text is ambiguous, reducing hallucinations. Finally, use enum for categorical nested fields (like shipping_method above). It cuts hallucination by 90% and costs the same.
Check your understanding
You're extracting customer orders where each order contains nested items, and some items have nested discounts applied at the item level. Your current schema requires every discount object to have a reason_code. The API returns a valid response, but some items are missing discount objects entirely. Is this a schema problem, a model problem, or neither: and what's your next move?
Show answer hint
It's neither. The schema allows the discount object to be omitted (not in required[]), so the model correctly returned items without discounts. This is working as designed. Your next move: decide if discount is always expected (add to required) or if null/missing is acceptable. The schema gave you the right behavior: the question is whether you designed it to match your business logic.