How to build a data extraction pipeline with LLMs
Quick answer
Use a modern LLM like
gpt-4o to parse unstructured text into structured data by prompting it with clear extraction instructions and output schemas. Implement a pipeline in Python that sends text chunks to the LLM via the OpenAI SDK, then parses and stores the structured JSON output for downstream use.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
This example shows how to extract structured fields from raw text using gpt-4o. The prompt instructs the model to output JSON with specific keys. The pipeline sends text to the model, receives JSON, and parses it for further processing.
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample unstructured text to extract data from
raw_text = '''\nInvoice Number: 12345\nDate: 2026-04-01\nTotal Amount: $1,234.56\nVendor: Acme Corp\n'''
# Prompt template instructing the model to output JSON
prompt = f"""
Extract the following fields from the text below and output a JSON object with keys: invoice_number, date, total_amount, vendor.
Text:\n{raw_text}
JSON:"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
json_text = response.choices[0].message.content.strip()
# Parse the JSON output safely
try:
extracted_data = json.loads(json_text)
except json.JSONDecodeError:
extracted_data = None
print("Extracted data:", extracted_data) output
Extracted data: {'invoice_number': '12345', 'date': '2026-04-01', 'total_amount': '$1,234.56', 'vendor': 'Acme Corp'} Common variations
You can adapt this pipeline by using asynchronous calls for higher throughput, switching to other models like claude-3-5-sonnet-20241022 for different style or cost, or adding streaming to process large documents incrementally.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def extract_data_async(text: str):
prompt = f"Extract invoice_number, date, total_amount, vendor from the text below and output JSON.\nText:\n{text}\nJSON:"
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content.strip()
async def main():
raw_text = '''Invoice Number: 67890\nDate: 2026-04-02\nTotal Amount: $789.00\nVendor: Beta LLC'''
json_output = await extract_data_async(raw_text)
print("Async extracted JSON:", json_output)
asyncio.run(main()) output
Async extracted JSON: {"invoice_number": "67890", "date": "2026-04-02", "total_amount": "$789.00", "vendor": "Beta LLC"} Troubleshooting
- If the model output is not valid JSON, add explicit instructions to output only JSON and consider using a JSON schema validator.
- If you hit rate limits, implement exponential backoff or batch your requests.
- For inconsistent extraction, refine your prompt with examples or use few-shot prompting.
Key Takeaways
- Use clear, explicit prompts instructing the LLM to output structured JSON for reliable data extraction.
- Parse the LLM's JSON output safely in Python to integrate extracted data into your pipeline.
- Leverage async calls or streaming for scalability when processing large volumes of text.
- Refine prompts iteratively to improve extraction accuracy and handle edge cases.
- Monitor API usage and handle errors gracefully to maintain pipeline robustness.