How to Intermediate · 4 min read

How to build a data extraction pipeline with LLMs

Quick answer
Use a modern LLM like gpt-4o to parse unstructured text into structured data by prompting it with clear extraction instructions and output schemas. Implement a pipeline in Python that sends text chunks to the LLM via the OpenAI SDK, then parses and stores the structured JSON output for downstream use.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

This example shows how to extract structured fields from raw text using gpt-4o. The prompt instructs the model to output JSON with specific keys. The pipeline sends text to the model, receives JSON, and parses it for further processing.

python
import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample unstructured text to extract data from
raw_text = '''\nInvoice Number: 12345\nDate: 2026-04-01\nTotal Amount: $1,234.56\nVendor: Acme Corp\n''' 

# Prompt template instructing the model to output JSON
prompt = f"""
Extract the following fields from the text below and output a JSON object with keys: invoice_number, date, total_amount, vendor.
Text:\n{raw_text}

JSON:"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

json_text = response.choices[0].message.content.strip()

# Parse the JSON output safely
try:
    extracted_data = json.loads(json_text)
except json.JSONDecodeError:
    extracted_data = None

print("Extracted data:", extracted_data)
output
Extracted data: {'invoice_number': '12345', 'date': '2026-04-01', 'total_amount': '$1,234.56', 'vendor': 'Acme Corp'}

Common variations

You can adapt this pipeline by using asynchronous calls for higher throughput, switching to other models like claude-3-5-sonnet-20241022 for different style or cost, or adding streaming to process large documents incrementally.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def extract_data_async(text: str):
    prompt = f"Extract invoice_number, date, total_amount, vendor from the text below and output JSON.\nText:\n{text}\nJSON:" 
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

async def main():
    raw_text = '''Invoice Number: 67890\nDate: 2026-04-02\nTotal Amount: $789.00\nVendor: Beta LLC''' 
    json_output = await extract_data_async(raw_text)
    print("Async extracted JSON:", json_output)

asyncio.run(main())
output
Async extracted JSON: {"invoice_number": "67890", "date": "2026-04-02", "total_amount": "$789.00", "vendor": "Beta LLC"}

Troubleshooting

  • If the model output is not valid JSON, add explicit instructions to output only JSON and consider using a JSON schema validator.
  • If you hit rate limits, implement exponential backoff or batch your requests.
  • For inconsistent extraction, refine your prompt with examples or use few-shot prompting.

Key Takeaways

  • Use clear, explicit prompts instructing the LLM to output structured JSON for reliable data extraction.
  • Parse the LLM's JSON output safely in Python to integrate extracted data into your pipeline.
  • Leverage async calls or streaming for scalability when processing large volumes of text.
  • Refine prompts iteratively to improve extraction accuracy and handle edge cases.
  • Monitor API usage and handle errors gracefully to maintain pipeline robustness.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗