How to extract structured data from unstructured text with AI
Quick answer
Use a large language model like
gpt-4o to parse unstructured text by prompting it to identify and format key data fields into structured JSON or CSV. This involves crafting clear instructions and examples in the prompt to guide the model's output format.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.
pip install openai>=1.0 Step by step
This example shows how to extract structured data such as names, dates, and amounts from unstructured text by prompting gpt-4o to output JSON.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
unstructured_text = '''\nInvoice from Acme Corp\nDate: March 15, 2026\nTotal: $1,234.56\nCustomer: John Doe\n'''
prompt = f"Extract the invoice data as JSON with keys: company, date, total, customer.\nText:\n{unstructured_text}\nJSON:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
structured_data = response.choices[0].message.content
print(structured_data) output
{
"company": "Acme Corp",
"date": "March 15, 2026",
"total": "$1,234.56",
"customer": "John Doe"
} Common variations
You can use other models like claude-3-5-haiku-20241022 or gemini-1.5-pro for extraction. Async calls and streaming responses are also supported for large texts or real-time processing.
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
unstructured_text = '''\nInvoice from Acme Corp\nDate: March 15, 2026\nTotal: $1,234.56\nCustomer: John Doe\n'''
system_prompt = "You extract invoice data as JSON with keys: company, date, total, customer."
user_prompt = f"Text:\n{unstructured_text}\nJSON:"
message = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=256,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)
print(message.content[0].text) output
{
"company": "Acme Corp",
"date": "March 15, 2026",
"total": "$1,234.56",
"customer": "John Doe"
} Troubleshooting
- If the output is not valid JSON, add explicit instructions in the prompt to "only output JSON" and consider using a JSON schema or examples.
- If the model misses fields, provide more detailed examples or increase the prompt context.
- For very long texts, chunk the input and aggregate results.
Key Takeaways
- Use clear, explicit prompts instructing the model to output structured JSON for reliable extraction.
- Choose models like
gpt-4oorclaude-3-5-haiku-20241022for best accuracy in data extraction tasks. - Chunk large unstructured texts to maintain context and improve extraction quality.
- Validate and parse the model output programmatically to handle any formatting inconsistencies.
- Provide examples in prompts to guide the model toward the desired structured format.