How to design Pydantic schemas for extraction
Quick answer
Use
Pydantic models to define structured schemas representing the expected data fields for extraction. Pass these models as response_model to the instructor client when calling chat.completions.create to get typed, validated extraction results from AI responses.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 instructor pydantic
Setup
Install the required packages and set your OpenAI API key as an environment variable.
pip install openai instructor pydantic Step by step
Define a Pydantic model representing the data you want to extract, then use instructor.from_openai to create a client wrapping the OpenAI SDK. Call chat.completions.create with your schema as response_model to get structured extraction.
import os
from pydantic import BaseModel
from openai import OpenAI
import instructor
# Define Pydantic schema for extraction
class User(BaseModel):
name: str
age: int
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Wrap with instructor for structured extraction
extractor = instructor.from_openai(client)
# Prompt with extraction request
prompt = "Extract: John is 30 years old"
# Call chat completion with response_model
response = extractor.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_model=User
)
# Access typed extraction result
user = response
print(f"Name: {user.name}, Age: {user.age}") output
Name: John, Age: 30
Common variations
- Use async calls with
await extractor.chat.completions.acreate(...)for asynchronous extraction. - Switch to Anthropic by using
instructor.from_anthropicwith an Anthropic client. - Define nested or optional fields in
Pydanticmodels for complex extraction tasks. - Use different OpenAI models like
gpt-4oorgpt-4o-minidepending on cost and accuracy needs.
Troubleshooting
- If extraction fields are missing or incorrect, ensure your prompt clearly instructs the AI to provide the data in the expected format.
- Validate your
Pydanticschema matches the expected response structure exactly to avoid validation errors. - Check your API key and environment variables if you get authentication errors.
- Use
max_tokensparameter to allow enough tokens for the AI to complete the extraction.
Key Takeaways
- Define clear
Pydanticmodels to represent the exact data you want extracted. - Use
instructorwithresponse_modelto get typed, validated AI extraction results. - Adjust prompts and schema carefully to ensure accurate and complete extraction.
- Async and Anthropic clients are supported for flexible integration.
- Always set your API key securely via environment variables.