How to extract structured data from unstructured text with LLM
Quick answer
Use a large language model (LLM) like
gpt-4o to parse unstructured text by prompting it to output JSON or other structured formats. Send the text with a clear instruction in the messages parameter and parse the model's JSON response to extract structured data.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable.
- Run
pip install openaito install the SDK. - Set your API key in your shell:
export OPENAI_API_KEY='your_api_key_here'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key_here"(Windows).
pip install openai Step by step
This example shows how to send unstructured text to the gpt-4o model with a prompt instructing it to extract structured JSON data. The response is parsed to get the structured output.
import os
from openai import OpenAI
import json
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
unstructured_text = """
John Doe, born on 1990-05-15, lives at 123 Elm St, Springfield. His email is john.doe@example.com and phone number is (555) 123-4567.
"""
prompt = f"Extract the following fields as JSON: name, birthdate, address, email, phone from the text below.\nText:\n{unstructured_text}\nJSON:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
json_text = response.choices[0].message.content.strip()
try:
structured_data = json.loads(json_text)
except json.JSONDecodeError:
structured_data = None
print("Extracted structured data:", structured_data) output
Extracted structured data: {'name': 'John Doe', 'birthdate': '1990-05-15', 'address': '123 Elm St, Springfield', 'email': 'john.doe@example.com', 'phone': '(555) 123-4567'} Common variations
You can use other models like claude-3-5-sonnet-20241022 from Anthropic with similar prompt engineering. Async calls and streaming responses are also possible with respective SDKs. For more complex extraction, chain multiple prompts or use LangChain for orchestration.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
unstructured_text = "John Doe, born on 1990-05-15, lives at 123 Elm St, Springfield. His email is john.doe@example.com and phone number is (555) 123-4567."
system_prompt = "You extract structured JSON data from unstructured text."
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": f"Extract JSON with fields name, birthdate, address, email, phone from this text:\n{unstructured_text}"}]
)
print("Extracted JSON:", response.content[0].text) output
Extracted JSON: {"name": "John Doe", "birthdate": "1990-05-15", "address": "123 Elm St, Springfield", "email": "john.doe@example.com", "phone": "(555) 123-4567"} Troubleshooting
- If the model returns malformed JSON, try adding explicit instructions like "Respond only with valid JSON" or use a JSON schema validation step.
- If the output is incomplete, increase
max_tokensor simplify the prompt. - For inconsistent field names, normalize keys in post-processing.
Key Takeaways
- Use clear, explicit prompts instructing the LLM to output JSON for reliable structured extraction.
- Parse the LLM's JSON response safely with error handling to avoid crashes on malformed output.
- Anthropic and OpenAI both support structured data extraction with similar prompt patterns.
- Adjust
max_tokensand prompt detail to improve extraction completeness and accuracy. - Post-process and normalize extracted data to handle variations in model output formatting.