How to beginner · 3 min read

How to extract from long documents with Instructor

Q: How to extract from long documents with Instructor

Use the instructor Python library with an OpenAI client to extract structured information from long documents by passing the full text in a chat completion request with a suitable prompt and a response_model. Instructor handles parsing and validation of the extracted data from the AI response.

Quick answer

Use the instructor Python library with an OpenAI client to extract structured information from long documents by passing the full text in a chat completion request with a suitable prompt and a response_model. Instructor handles parsing and validation of the extracted data from the AI response.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 instructor pydantic

Setup

Install the required packages and set your OpenAI API key as an environment variable.

Install packages: pip install openai instructor pydantic
Set environment variable in your shell: export OPENAI_API_KEY='your_api_key'

bash

pip install openai instructor pydantic

Step by step

This example shows how to extract structured data from a long document using instructor with OpenAI's gpt-4o-mini model. Define a Pydantic model for the expected output, then call client.chat.completions.create with response_model to parse the AI's response.

python

import os
from openai import OpenAI
import instructor
from pydantic import BaseModel

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Wrap OpenAI client with Instructor
inst_client = instructor.from_openai(client)

# Define a Pydantic model for extraction
class DocumentInfo(BaseModel):
    title: str
    author: str
    summary: str

# Long document text (example)
long_document = '''\
Title: The Future of AI
Author: Jane Doe

Artificial intelligence (AI) is rapidly evolving and impacting many industries. This document explores key trends and future directions in AI research and applications.
'''

# Create chat completion with response_model
response = inst_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Extract the title, author, and a brief summary from this document:\n\n{long_document}"}],
    response_model=DocumentInfo
)

# Access extracted data
extracted = response
print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Summary: {extracted.summary}")

output

Title: The Future of AI
Author: Jane Doe
Summary: Artificial intelligence (AI) is rapidly evolving and impacting many industries, with key trends and future directions explored.

Common variations

You can use asynchronous calls with await if your environment supports it. Also, you can switch to other OpenAI models like gpt-4o for higher quality or use different Pydantic models to extract other structured data.

python

import asyncio

async def async_extract():
    response = await inst_client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Extract title, author, and summary from:\n\n{long_document}"}],
        response_model=DocumentInfo
    )
    print(f"Async Title: {response.title}")

asyncio.run(async_extract())

output

Async Title: The Future of AI

Troubleshooting

If you get validation errors, ensure your response_model matches the expected output format.
If the extraction is incomplete, try refining your prompt for clarity.
Check your OPENAI_API_KEY environment variable is set correctly.

✅

Key Takeaways

Use instructor.from_openai() to wrap an OpenAI client for structured extraction.
Define a Pydantic response_model to parse AI responses into typed data.
Pass the full document text in the prompt for extraction from long documents.
Async calls and different OpenAI models are supported for flexibility.
Refine prompts and validate models to improve extraction accuracy.

Verified 2026-04 · gpt-4o-mini, gpt-4o

Verify ↗