How to beginner · 3 min read

How to extract from long documents with Instructor

Quick answer
Use the instructor Python library with an OpenAI client to extract structured information from long documents by passing the full text in a chat completion request with a suitable prompt and a response_model. Instructor handles parsing and validation of the extracted data from the AI response.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 instructor pydantic

Setup

Install the required packages and set your OpenAI API key as an environment variable.

  • Install packages: pip install openai instructor pydantic
  • Set environment variable in your shell: export OPENAI_API_KEY='your_api_key'
bash
pip install openai instructor pydantic

Step by step

This example shows how to extract structured data from a long document using instructor with OpenAI's gpt-4o-mini model. Define a Pydantic model for the expected output, then call client.chat.completions.create with response_model to parse the AI's response.

python
import os
from openai import OpenAI
import instructor
from pydantic import BaseModel

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Wrap OpenAI client with Instructor
inst_client = instructor.from_openai(client)

# Define a Pydantic model for extraction
class DocumentInfo(BaseModel):
    title: str
    author: str
    summary: str

# Long document text (example)
long_document = '''\
Title: The Future of AI
Author: Jane Doe

Artificial intelligence (AI) is rapidly evolving and impacting many industries. This document explores key trends and future directions in AI research and applications.
'''

# Create chat completion with response_model
response = inst_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Extract the title, author, and a brief summary from this document:\n\n{long_document}"}],
    response_model=DocumentInfo
)

# Access extracted data
extracted = response
print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Summary: {extracted.summary}")
output
Title: The Future of AI
Author: Jane Doe
Summary: Artificial intelligence (AI) is rapidly evolving and impacting many industries, with key trends and future directions explored.

Common variations

You can use asynchronous calls with await if your environment supports it. Also, you can switch to other OpenAI models like gpt-4o for higher quality or use different Pydantic models to extract other structured data.

python
import asyncio

async def async_extract():
    response = await inst_client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Extract title, author, and summary from:\n\n{long_document}"}],
        response_model=DocumentInfo
    )
    print(f"Async Title: {response.title}")

asyncio.run(async_extract())
output
Async Title: The Future of AI

Troubleshooting

  • If you get validation errors, ensure your response_model matches the expected output format.
  • If the extraction is incomplete, try refining your prompt for clarity.
  • Check your OPENAI_API_KEY environment variable is set correctly.

Key Takeaways

  • Use instructor.from_openai() to wrap an OpenAI client for structured extraction.
  • Define a Pydantic response_model to parse AI responses into typed data.
  • Pass the full document text in the prompt for extraction from long documents.
  • Async calls and different OpenAI models are supported for flexibility.
  • Refine prompts and validate models to improve extraction accuracy.
Verified 2026-04 · gpt-4o-mini, gpt-4o
Verify ↗