How to use Instructor for data extraction
Quick answer
Use the
instructor Python library to wrap OpenAI's OpenAI client for structured data extraction by defining Pydantic models and passing them as response_model in client.chat.completions.create. This enables precise extraction of fields from unstructured text with minimal code.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 instructor pydantic
Setup
Install the required packages and set your OpenAI API key in the environment.
- Install packages:
pip install openai instructor pydantic - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai instructor pydantic output
Collecting openai Collecting instructor Collecting pydantic Successfully installed openai instructor pydantic
Step by step
Define a Pydantic model for the data you want to extract, then use instructor.from_openai to create a client that wraps OpenAI's OpenAI client. Call chat.completions.create with your model as response_model and pass the user prompt in messages.
import os
from openai import OpenAI
import instructor
from pydantic import BaseModel
# Define the data model for extraction
class User(BaseModel):
name: str
age: int
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Wrap with Instructor for structured extraction
instructor_client = instructor.from_openai(client)
# User prompt with data to extract
messages = [{"role": "user", "content": "Extract: John is 30 years old"}]
# Call chat completion with response_model
user = instructor_client.chat.completions.create(
model="gpt-4o-mini",
response_model=User,
messages=messages
)
print(f"Name: {user.name}, Age: {user.age}") output
Name: John, Age: 30
Common variations
You can use different models like gpt-4o for higher accuracy or claude-3-5-sonnet-20241022 with Anthropic by wrapping their client via instructor.from_anthropic. Async usage is also supported by calling await on the create method in an async function.
import asyncio
import os
import anthropic
import instructor
from pydantic import BaseModel
class User(BaseModel):
name: str
age: int
async def main():
# Anthropic client
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
instructor_client = instructor.from_anthropic(client)
messages = [{"role": "user", "content": "Extract: Alice is 25 years old"}]
user = await instructor_client.chat.completions.create(
model="claude-3-5-sonnet-20241022",
response_model=User,
messages=messages
)
print(f"Name: {user.name}, Age: {user.age}")
asyncio.run(main()) output
Name: Alice, Age: 25
Troubleshooting
- If extraction fields are
Noneor missing, ensure your Pydantic model matches the expected data types exactly. - If you get API errors, verify your
OPENAI_API_KEYorANTHROPIC_API_KEYenvironment variables are set correctly. - Use smaller prompts or increase
max_tokensif the model truncates output.
Key Takeaways
- Use
instructorwith Pydantic models to extract structured data from text easily. - Wrap the OpenAI or Anthropic client with
instructor.from_openaiorinstructor.from_anthropicrespectively. - Pass your Pydantic model as
response_modelinchat.completions.createfor automatic parsing. - Async calls and different models are supported for flexibility and performance.
- Always verify environment variables and model names to avoid runtime errors.