How to beginner · 4 min read

How to extract invoice data with LLM

Q: How to extract invoice data with LLM

Use a structured prompt with a chat.completions.create call to an LLM like gpt-4o to extract invoice fields. Provide the invoice text and a clear extraction schema in the prompt, then parse the JSON response for fields like invoice number, date, and total amount.

Quick answer

Use a structured prompt with a chat.completions.create call to an LLM like gpt-4o to extract invoice fields. Provide the invoice text and a clear extraction schema in the prompt, then parse the JSON response for fields like invoice number, date, and total amount.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the official openai Python SDK and set your API key as an environment variable.

Install SDK: pip install openai
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example shows how to extract key invoice data fields by sending the invoice text to gpt-4o with a prompt requesting JSON output. The response is parsed to access structured data.

python

import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

invoice_text = '''
Invoice Number: INV-12345
Date: 2026-03-15
Bill To: Acme Corp
Total Amount: $1,234.56
'''

prompt = f"""Extract the following fields from the invoice text below as JSON with keys: invoice_number, date, bill_to, total_amount.

Invoice text:\n{invoice_text}\n
JSON:"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_json = response.choices[0].message.content

try:
    invoice_data = json.loads(extracted_json)
except json.JSONDecodeError:
    invoice_data = {}

print("Extracted invoice data:", invoice_data)

output

Extracted invoice data: {'invoice_number': 'INV-12345', 'date': '2026-03-15', 'bill_to': 'Acme Corp', 'total_amount': '$1,234.56'}

Common variations

You can use asynchronous calls with asyncio for better performance in batch processing. Streaming responses allow token-by-token processing for large invoices. Different models like gpt-4o-mini can reduce cost with slightly less accuracy. You can also use Anthropic's claude-3-5-sonnet-20241022 with similar prompt engineering.

python

import os
import json
import asyncio
from openai import OpenAI

async def extract_invoice_async(invoice_text: str):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    prompt = f"""Extract invoice_number, date, bill_to, total_amount from the invoice text below as JSON.\n\nInvoice text:\n{invoice_text}\n\nJSON:"""
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    extracted_json = response.choices[0].message.content
    try:
        return json.loads(extracted_json)
    except json.JSONDecodeError:
        return {}

async def main():
    invoice_text = '''\nInvoice Number: INV-67890\nDate: 2026-04-01\nBill To: Beta LLC\nTotal Amount: $987.65\n'''
    data = await extract_invoice_async(invoice_text)
    print("Async extracted data:", data)

if __name__ == "__main__":
    asyncio.run(main())

output

Async extracted data: {'invoice_number': 'INV-67890', 'date': '2026-04-01', 'bill_to': 'Beta LLC', 'total_amount': '$987.65'}

Troubleshooting

If the JSON response is malformed, ensure your prompt clearly requests JSON output and consider adding explicit instructions like "Respond ONLY with JSON."
If you get API authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
For large invoices, chunk the text and extract data piecewise to avoid token limits.
If extraction accuracy is low, refine your prompt with examples or use a stronger model like gpt-4o.

✅

Key Takeaways

Use clear, structured prompts requesting JSON to extract invoice fields with LLMs.
Parse the LLM's JSON response to access invoice data programmatically.
Async and streaming calls improve performance for batch or large invoice processing.
Model choice balances cost and accuracy; gpt-4o is best for precision.
Prompt clarity and examples improve extraction reliability.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗