How to beginner · 4 min read

How to extract structured data from text with AI

Q: How to extract structured data from text with AI

Use a modern AI chat model like gpt-4o-mini with a structured response schema by specifying a response_model or JSON format in your prompt. The OpenAI Python SDK supports this via chat.completions.create with a Pydantic model or explicit JSON instructions to reliably extract structured data from text.

Quick answer

Use a modern AI chat model like gpt-4o-mini with a structured response schema by specifying a response_model or JSON format in your prompt. The OpenAI Python SDK supports this via chat.completions.create with a Pydantic model or explicit JSON instructions to reliably extract structured data from text.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install pydantic

Setup

Install the openai Python SDK and pydantic for structured data validation. Set your OpenAI API key as an environment variable for secure authentication.

Install packages: pip install openai pydantic
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)

bash

pip install openai pydantic

output

Collecting openai
Collecting pydantic
Successfully installed openai-1.x.x pydantic-2.x.x

Step by step

Define a pydantic.BaseModel to specify the structured data schema you want to extract. Use the OpenAI chat.completions.create method with response_model to parse the AI's JSON response into your model.

This example extracts a user's name and age from a text input.

python

import os
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class User(BaseModel):
    name: str
    age: int

prompt = "Extract the user's name and age from this text: 'John is 30 years old.'"

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    response_model=User
)

user = response
print(f"Name: {user.name}, Age: {user.age}")

output

Name: John, Age: 30

Common variations

You can extract more complex nested data by defining nested pydantic models. Alternatively, instruct the model to output JSON and parse it manually if you prefer not to use response_model. You can also use other AI providers like Anthropic with similar structured extraction patterns.

python

import json
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = (
    "Extract the user's name and age from this text and respond with JSON only:"
    " 'Alice is 25 years old.'"
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

json_text = response.choices[0].message.content

# Parse JSON manually
user_data = json.loads(json_text)
print(f"Name: {user_data['name']}, Age: {user_data['age']}")

output

Name: Alice, Age: 25

Troubleshooting

If the AI response is not valid JSON, ensure your prompt clearly instructs the model to respond with JSON only.
If response_model parsing fails, check that your model schema matches the expected output structure.
Use max_tokens to limit response length and avoid truncation.

✅

Key Takeaways

Use pydantic models with OpenAI's response_model for reliable structured extraction.
Explicitly instruct the AI to output JSON when not using response_model and parse manually.
Validate and match your schema to the expected AI output to avoid parsing errors.
Set environment variables for API keys to keep credentials secure.
Adjust max_tokens to prevent incomplete responses.

Verified 2026-04 · gpt-4o-mini

Verify ↗