How to beginner · 3 min read

How to extract structured data from unstructured text with LLM

Q: How to extract structured data from unstructured text with LLM

Use a large language model (LLM) like gpt-4o to parse unstructured text by prompting it to output JSON or other structured formats. Send the text with a clear instruction in the messages parameter and parse the model's JSON response to extract structured data.

Quick answer

Use a large language model (LLM) like gpt-4o to parse unstructured text by prompting it to output JSON or other structured formats. Send the text with a clear instruction in the messages parameter and parse the model's JSON response to extract structured data.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable.

Run pip install openai to install the SDK.
Set your API key in your shell: export OPENAI_API_KEY='your_api_key_here' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key_here" (Windows).

bash

pip install openai

Step by step

This example shows how to send unstructured text to the gpt-4o model with a prompt instructing it to extract structured JSON data. The response is parsed to get the structured output.

python

import os
from openai import OpenAI
import json

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

unstructured_text = """
John Doe, born on 1990-05-15, lives at 123 Elm St, Springfield. His email is john.doe@example.com and phone number is (555) 123-4567.
"""

prompt = f"Extract the following fields as JSON: name, birthdate, address, email, phone from the text below.\nText:\n{unstructured_text}\nJSON:" 

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

json_text = response.choices[0].message.content.strip()

try:
    structured_data = json.loads(json_text)
except json.JSONDecodeError:
    structured_data = None

print("Extracted structured data:", structured_data)

output

Extracted structured data: {'name': 'John Doe', 'birthdate': '1990-05-15', 'address': '123 Elm St, Springfield', 'email': 'john.doe@example.com', 'phone': '(555) 123-4567'}

Common variations

You can use other models like claude-3-5-sonnet-20241022 from Anthropic with similar prompt engineering. Async calls and streaming responses are also possible with respective SDKs. For more complex extraction, chain multiple prompts or use LangChain for orchestration.

python

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

unstructured_text = "John Doe, born on 1990-05-15, lives at 123 Elm St, Springfield. His email is john.doe@example.com and phone number is (555) 123-4567."

system_prompt = "You extract structured JSON data from unstructured text."

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    system=system_prompt,
    messages=[{"role": "user", "content": f"Extract JSON with fields name, birthdate, address, email, phone from this text:\n{unstructured_text}"}]
)

print("Extracted JSON:", response.content[0].text)

output

Extracted JSON: {"name": "John Doe", "birthdate": "1990-05-15", "address": "123 Elm St, Springfield", "email": "john.doe@example.com", "phone": "(555) 123-4567"}

Troubleshooting

If the model returns malformed JSON, try adding explicit instructions like "Respond only with valid JSON" or use a JSON schema validation step.
If the output is incomplete, increase max_tokens or simplify the prompt.
For inconsistent field names, normalize keys in post-processing.

✅

Key Takeaways

Use clear, explicit prompts instructing the LLM to output JSON for reliable structured extraction.
Parse the LLM's JSON response safely with error handling to avoid crashes on malformed output.
Anthropic and OpenAI both support structured data extraction with similar prompt patterns.
Adjust max_tokens and prompt detail to improve extraction completeness and accuracy.
Post-process and normalize extracted data to handle variations in model output formatting.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗