How to beginner · 3 min read

How to batch extract with Instructor

Quick answer
Use the instructor library with an OpenAI client to batch extract structured data by passing multiple inputs in a loop or list comprehension to client.chat.completions.create with a response_model. This enables efficient extraction of multiple texts in one script using Pydantic models.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 instructor pydantic

Setup

Install the required packages and set your OpenAI API key as an environment variable.

  • Install packages: pip install openai instructor pydantic
  • Set environment variable in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)
bash
pip install openai instructor pydantic

Step by step

Define a Pydantic model for the structured data you want to extract, then use instructor.from_openai to create a client wrapping the OpenAI SDK. Loop over your batch of texts and call client.chat.completions.create with response_model to extract data for each input.

python
import os
from openai import OpenAI
import instructor
from pydantic import BaseModel

# Define the structured data model
class User(BaseModel):
    name: str
    age: int

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Wrap with Instructor client
inst_client = instructor.from_openai(client)

# Batch of texts to extract from
texts = [
    "Extract: John is 30 years old",
    "Extract: Alice is 25 years old",
    "Extract: Bob is 40 years old"
]

# Extract data in batch
results = []
for text in texts:
    response = inst_client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=User,
        messages=[{"role": "user", "content": text}]
    )
    results.append(response)

# Print extracted data
for res in results:
    print(f"Name: {res.name}, Age: {res.age}")
output
Name: John, Age: 30
Name: Alice, Age: 25
Name: Bob, Age: 40

Common variations

You can perform batch extraction asynchronously by using await with inst_client.chat.completions.acreate inside an async function. Also, you can switch to different OpenAI models like gpt-4o for higher accuracy or use other Pydantic models for different extraction schemas.

python
import asyncio

async def batch_extract_async(texts):
    results = []
    for text in texts:
        response = await inst_client.chat.completions.acreate(
            model="gpt-4o",
            response_model=User,
            messages=[{"role": "user", "content": text}]
        )
        results.append(response)
    return results

texts = [
    "Extract: John is 30 years old",
    "Extract: Alice is 25 years old",
    "Extract: Bob is 40 years old"
]

results = asyncio.run(batch_extract_async(texts))
for res in results:
    print(f"Name: {res.name}, Age: {res.age}")
output
Name: John, Age: 30
Name: Alice, Age: 25
Name: Bob, Age: 40

Troubleshooting

  • If extraction results are missing fields or incorrect, verify your Pydantic model matches the expected output format.
  • If you get authentication errors, ensure OPENAI_API_KEY is set correctly in your environment.
  • For rate limits, batch your requests with delays or use a higher quota plan.

Key Takeaways

  • Use instructor.from_openai with a Pydantic model to batch extract structured data efficiently.
  • Loop over your input texts and call client.chat.completions.create with response_model for each extraction.
  • Async extraction is supported via acreate for better throughput.
  • Ensure your Pydantic model matches the expected extraction schema to avoid parsing errors.
  • Always set your API key securely via environment variables.
Verified 2026-04 · gpt-4o-mini, gpt-4o
Verify ↗