How to extract structured data with Instructor
Quick answer
Use the
instructor Python library to define a pydantic.BaseModel representing your structured data schema, then call client.chat.completions.create with response_model=YourModel to extract typed data from text. This approach leverages OpenAI's gpt-4o-mini or similar models for precise structured extraction.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 instructor pydantic
Setup
Install the required packages and set your OpenAI API key as an environment variable.
- Install packages:
pip install openai instructor pydantic - Set environment variable in your shell:
export OPENAI_API_KEY='your_api_key'
pip install openai instructor pydantic Step by step
Define a pydantic.BaseModel for the structured data you want to extract, then use instructor.from_openai to create a client wrapping OpenAI. Call client.chat.completions.create with your model and input text to get typed structured output.
import os
from pydantic import BaseModel
import instructor
from openai import OpenAI
# Define your structured data model
class User(BaseModel):
name: str
age: int
# Initialize OpenAI client
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Wrap OpenAI client with Instructor
client = instructor.from_openai(openai_client)
# Input text to extract from
input_text = "Extract: John is 30 years old"
# Call chat completion with response_model
response = client.chat.completions.create(
model="gpt-4o-mini",
response_model=User,
messages=[{"role": "user", "content": input_text}]
)
# Access structured data
user = response
print(f"Name: {user.name}, Age: {user.age}") output
Name: John, Age: 30
Common variations
You can use different OpenAI models like gpt-4o or gpt-4o-mini depending on your accuracy and cost needs. Instructor also supports Anthropic models via instructor.from_anthropic. For asynchronous usage, use await client.chat.completions.acreate(...) in an async function.
import asyncio
async def async_extract():
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
response_model=User,
messages=[{"role": "user", "content": "Extract: Alice is 25 years old"}]
)
print(f"Name: {response.name}, Age: {response.age}")
asyncio.run(async_extract()) output
Name: Alice, Age: 25
Troubleshooting
- If you get validation errors, ensure your
pydanticmodel matches the expected data format. - If the API returns unexpected results, try adding more context or examples in the prompt.
- Make sure your
OPENAI_API_KEYenvironment variable is set correctly.
Key Takeaways
- Define your data schema with
pydantic.BaseModelfor typed extraction. - Use
instructor.from_openaito wrap OpenAI client for structured responses. - Pass
response_model=YourModeltochat.completions.createfor automatic parsing. - Supports async calls and multiple models for flexibility.
- Validate your model and prompt to improve extraction accuracy.