Code Intermediate medium · 6 min

Nested schemas: complex structured output

What you will learn

Define hierarchical Pydantic models to extract deeply nested structured data from unstructured text.

Why this matters

Real-world data rarely fits flat schemas: you need to extract hierarchical relationships (documents with sections, people with addresses, orders with line items). Nested schemas let you model these relationships and get strongly-typed, validated output directly from LLMs.

Skip if: Don't use nested schemas if you need simple flat extraction (single layer of fields) or if the LLM must be free to choose arbitrary nesting depth at runtime. For truly variable hierarchies, use a flat schema with back-references or reconsider whether extraction is the right approach.

Explanation

A nested schema is a Pydantic model that contains other Pydantic models as field types, creating a hierarchy of structured data. LLMs can understand and respect this hierarchy, extracting data into the correct nested containers.

Mechanically: you define an inner model (e.g., Address), then include it as a field type in an outer model (e.g., Person). When you pass this composite model to llm.with_structured_output(), the LLM receives the full JSON schema and generates output that matches the nested structure. The output parser validates against the entire hierarchy and raises errors if any field is malformed.

Use nested schemas when your domain naturally has parent-child relationships and the nesting depth is known in advance. They're common in real-world extraction: business documents with metadata and content, user profiles with contact information, or technical specs with sections and subsections.

Analogy

Think of it like a filing cabinet. A flat schema is a single sheet of paper with labeled blanks. A nested schema is a file folder that contains subfolders, which contain their own labeled documents. The LLM understands the folder structure and puts information in the right place.

Code

python

from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")
    zip_code: str = Field(description="ZIP or postal code")

class PhoneNumber(BaseModel):
    country_code: str = Field(description="Country code, e.g. +1")
    number: str = Field(description="Phone number without country code")

class Person(BaseModel):
    name: str = Field(description="Full name")
    age: int = Field(description="Age in years")
    address: Address = Field(description="Home address")
    phone: PhoneNumber = Field(description="Contact phone number")
    hobbies: list[str] = Field(description="List of hobbies")

llm = ChatOpenAI(model="gpt-4o-mini")
structured_llm = llm.with_structured_output(Person)

text = """
John Smith is 34 years old. He lives at 742 Evergreen Terrace,
Springfield, IL 62701. You can reach him at +1 555-0147.
He enjoys gardening, photography, and woodworking.
"""

result = structured_llm.invoke(text)
print(f"Name: {result.name}")
print(f"City: {result.address.city}")
print(f"Country code: {result.phone.country_code}")
print(f"Hobbies: {', '.join(result.hobbies)}")
print(f"\nFull object:\n{result}")

Output

Name: John Smith
City: Springfield
Country code: +1
Hobbies: gardening, photography, woodworking

Full object:
name='John Smith' age=34 address=Address(street='742 Evergreen Terrace', city='Springfield', zip_code='62701') phone=PhoneNumber(country_code='+1', number='555-0147') hobbies=['gardening', 'photography', 'woodworking']

What just happened?

The code defined three Pydantic models: two leaf models (<code>Address</code>, <code>PhoneNumber</code>) and one composite model (<code>Person</code>) that nests the other two. The LLM received the full schema including the nested structure. It parsed the unstructured text and returned a <code>Person</code> instance with fully populated nested <code>address</code> and <code>phone</code> fields, all validated against their respective models. You accessed nested values using dot notation.

Common gotcha

The most common mistake is forgetting that nested models must also be valid Pydantic models with proper type hints and Field descriptions. If you nest a plain dict or a non-Pydantic class, the LLM won't understand the expected structure and will either fail validation or produce malformed output. Always use BaseModel subclasses. Also, deeply nested structures (3+ levels) can confuse some LLM versions: if you go too deep, flatten intermediate levels or split into separate extraction calls.

Error recovery

ValidationError (nested field is wrong type)

The LLM generated a field value that doesn't match the nested model's type hint. Fix by adding more detailed Field descriptions to guide the LLM, e.g., <code>Field(description="Must be exactly in format +CC NNNNNNNNNN")</code>

ValidationError (required field missing)

A nested model field was left empty when it's required. Either make the field optional with <code>Optional[...]</code> or improve the prompt to clarify that the field must always be extracted.

KeyError when accessing nested field

You're trying to access a nested field that the LLM didn't populate. Check the result object first: <code>if result.address: result.address.city</code>. Or catch the AttributeError and provide a default.

Experienced dev note

Nested schemas force you to think about your domain model first, which is actually a feature. Senior developers know that the schema design determines extraction quality. Spend time on the Pydantic structure and Field descriptions before calling the LLM: a bad schema will cause the LLM to fail or produce invalid output no matter how clever your prompt is. Also, deeply nested structures (beyond 3 levels) often perform worse with smaller models; use gpt-4o or larger if you go deep, or refactor into multiple extraction passes.

Check your understanding

You're extracting data from documents where some documents have a company address and some have a personal address (with different fields). How would you model this with nested schemas: would you use optional nested fields, separate fields for each type, or something else? What trade-off are you making?

Show answer hint

A correct answer recognizes that you'd need either optional nested fields of different types, a union type, or separate root schemas. The key insight is understanding that Pydantic's type system enforces the schema at extraction time, so the structure must handle all cases: the LLM can't 'choose' a schema dynamically.

VERSION langchain >= 1.0.0 uses with_structured_output() with direct Pydantic model support. Earlier versions required with_json_schema() or manual schema construction. Ensure you're on langchain-core >= 0.3.0 for reliable nested model support.

Learn how to validate and refine extraction results by chaining the structured output through a verification step that checks for inconsistencies or missing cross-references between nested fields.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.