How to extract structured data from text with AI
Quick answer
Use a modern AI chat model like
gpt-4o-mini with a structured response schema by specifying a response_model or JSON format in your prompt. The OpenAI Python SDK supports this via chat.completions.create with a Pydantic model or explicit JSON instructions to reliably extract structured data from text.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install pydantic
Setup
Install the openai Python SDK and pydantic for structured data validation. Set your OpenAI API key as an environment variable for secure authentication.
- Install packages:
pip install openai pydantic - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai pydantic output
Collecting openai Collecting pydantic Successfully installed openai-1.x.x pydantic-2.x.x
Step by step
Define a pydantic.BaseModel to specify the structured data schema you want to extract. Use the OpenAI chat.completions.create method with response_model to parse the AI's JSON response into your model.
This example extracts a user's name and age from a text input.
import os
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
class User(BaseModel):
name: str
age: int
prompt = "Extract the user's name and age from this text: 'John is 30 years old.'"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_model=User
)
user = response
print(f"Name: {user.name}, Age: {user.age}") output
Name: John, Age: 30
Common variations
You can extract more complex nested data by defining nested pydantic models. Alternatively, instruct the model to output JSON and parse it manually if you prefer not to use response_model. You can also use other AI providers like Anthropic with similar structured extraction patterns.
import json
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = (
"Extract the user's name and age from this text and respond with JSON only:"
" 'Alice is 25 years old.'"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
json_text = response.choices[0].message.content
# Parse JSON manually
user_data = json.loads(json_text)
print(f"Name: {user_data['name']}, Age: {user_data['age']}") output
Name: Alice, Age: 25
Troubleshooting
- If the AI response is not valid JSON, ensure your prompt clearly instructs the model to respond with JSON only.
- If
response_modelparsing fails, check that your model schema matches the expected output structure. - Use
max_tokensto limit response length and avoid truncation.
Key Takeaways
- Use
pydanticmodels with OpenAI'sresponse_modelfor reliable structured extraction. - Explicitly instruct the AI to output JSON when not using
response_modeland parse manually. - Validate and match your schema to the expected AI output to avoid parsing errors.
- Set environment variables for API keys to keep credentials secure.
- Adjust
max_tokensto prevent incomplete responses.