How to extract data with OpenAI structured outputs
Quick answer
Use the OpenAI API's
chat.completions.create method with a prompt that instructs the model to respond in a structured JSON format. Then parse the response.choices[0].message.content as JSON to extract the data fields programmatically.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the official openai Python package and set your API key as an environment variable.
- Install package:
pip install openai - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example shows how to prompt gpt-4o to return a JSON object with extracted fields, then parse it in Python.
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = (
"Extract the user's name and age from the following text and return as JSON:\n"
"Text: 'John is 30 years old.'\n"
"Respond ONLY with a JSON object with keys 'name' and 'age'."
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
content = response.choices[0].message.content
try:
data = json.loads(content)
print(f"Name: {data['name']}")
print(f"Age: {data['age']}")
except json.JSONDecodeError:
print("Failed to parse JSON:", content) output
Name: John Age: 30
Common variations
You can use different models like gpt-4o-mini for faster, cheaper extraction or claude-3-5-sonnet-20241022 with the Anthropic SDK. Async calls and streaming are also supported but less common for structured extraction.
import os
import json
import asyncio
from openai import OpenAI
async def async_extract():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = (
"Extract the user's name and age from the following text and return as JSON:\n"
"Text: 'Alice is 25 years old.'\n"
"Respond ONLY with a JSON object with keys 'name' and 'age'."
)
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
content = response.choices[0].message.content
data = json.loads(content)
print(f"Name: {data['name']}")
print(f"Age: {data['age']}")
asyncio.run(async_extract()) output
Name: Alice Age: 25
Troubleshooting
- If JSON parsing fails, check the model's output for extra text or formatting and refine your prompt to instruct the model to respond with
ONLYJSON. - Use
try-exceptblocks to handle malformed JSON gracefully. - For complex extraction, consider using
response_modelwith theinstructorlibrary for schema validation.
Key Takeaways
- Use explicit prompts instructing the model to respond with JSON only for reliable structured output.
- Parse the
response.choices[0].message.contentas JSON to extract data fields programmatically. - Handle JSON parsing errors gracefully with try-except to avoid runtime crashes.
- Async and smaller models like
gpt-4o-minican speed up extraction with lower cost. - For strict schema enforcement, combine OpenAI with libraries like
instructorfor typed extraction.