How to extract tables with Instructor
Quick answer
Use the
Instructor library with an OpenAI client to extract tables by defining a pydantic model representing the table structure and passing it as response_model in client.chat.completions.create. The model parses the AI response into structured table data automatically.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 instructor pydantic
Setup
Install the required packages and set your OpenAI API key as an environment variable.
- Install packages:
pip install openai instructor pydantic - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai instructor pydantic Step by step
Define a pydantic model representing the table schema, then use instructor.from_openai to create a client that wraps the OpenAI client. Call chat.completions.create with response_model set to your table model to extract tables from text.
import os
from openai import OpenAI
import instructor
from pydantic import BaseModel
from typing import List
# Define a pydantic model for a table row
class TableRow(BaseModel):
item: str
quantity: int
price: float
# Define a model for the entire table
class Table(BaseModel):
rows: List[TableRow]
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Wrap OpenAI client with Instructor
inst_client = instructor.from_openai(client)
# Input text containing a table
text = """
Here is the sales data:
| Item | Quantity | Price |
|------------|----------|--------|
| Apples | 10 | 0.5 |
| Bananas | 5 | 0.3 |
| Cherries | 20 | 1.5 |
"""
# Create chat completion with response_model to extract table
response = inst_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Extract the table from the following text:\n{text}"}],
response_model=Table
)
# Access structured table data
table = response
for row in table.rows:
print(f"Item: {row.item}, Quantity: {row.quantity}, Price: {row.price}") output
Item: Apples, Quantity: 10, Price: 0.5 Item: Bananas, Quantity: 5, Price: 0.3 Item: Cherries, Quantity: 20, Price: 1.5
Common variations
You can use asynchronous calls with await if your environment supports it. Also, you can switch to other OpenAI models like gpt-4o for higher accuracy or use Anthropic models by wrapping their client with instructor.from_anthropic. Streaming is not applicable for structured extraction with response_model.
import asyncio
async def async_extract():
import os
from openai import OpenAI
import instructor
from pydantic import BaseModel
from typing import List
class TableRow(BaseModel):
item: str
quantity: int
price: float
class Table(BaseModel):
rows: List[TableRow]
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
inst_client = instructor.from_openai(client)
text = """
| Item | Quantity | Price |
|------------|----------|--------|
| Oranges | 15 | 0.7 |
| Grapes | 8 | 2.0 |
"""
response = await inst_client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract the table from the following text:\n{text}"}],
response_model=Table
)
for row in response.rows:
print(f"Item: {row.item}, Quantity: {row.quantity}, Price: {row.price}")
# To run async example:
# asyncio.run(async_extract()) output
Item: Oranges, Quantity: 15, Price: 0.7 Item: Grapes, Quantity: 8, Price: 2.0
Troubleshooting
- If the extracted data is incomplete or incorrect, ensure your
pydanticmodel matches the table structure exactly. - If you get validation errors, check that the AI output format matches your model fields and types.
- Use a more capable model like
gpt-4oif extraction quality is poor. - Verify your
OPENAI_API_KEYis set correctly and has access to the model.
Key Takeaways
- Define a precise
pydanticmodel to represent your table schema for accurate extraction. - Use
instructor.from_openaito wrap the OpenAI client and enable structured extraction withresponse_model. - Switch to more powerful models like
gpt-4ofor better table extraction accuracy. - Async extraction is supported with
acreatefor scalable applications. - Validate your environment variables and model compatibility to avoid extraction errors.