How to beginner · 4 min read

How to extract invoice data with LLM

Quick answer
Use a structured prompt with a chat.completions.create call to an LLM like gpt-4o to extract invoice fields. Provide the invoice text and a clear extraction schema in the prompt, then parse the JSON response for fields like invoice number, date, and total amount.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the official openai Python SDK and set your API key as an environment variable.

  • Install SDK: pip install openai
  • Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)
bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example shows how to extract key invoice data fields by sending the invoice text to gpt-4o with a prompt requesting JSON output. The response is parsed to access structured data.

python
import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

invoice_text = '''
Invoice Number: INV-12345
Date: 2026-03-15
Bill To: Acme Corp
Total Amount: $1,234.56
'''

prompt = f"""Extract the following fields from the invoice text below as JSON with keys: invoice_number, date, bill_to, total_amount.

Invoice text:\n{invoice_text}\n
JSON:"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_json = response.choices[0].message.content

try:
    invoice_data = json.loads(extracted_json)
except json.JSONDecodeError:
    invoice_data = {}

print("Extracted invoice data:", invoice_data)
output
Extracted invoice data: {'invoice_number': 'INV-12345', 'date': '2026-03-15', 'bill_to': 'Acme Corp', 'total_amount': '$1,234.56'}

Common variations

You can use asynchronous calls with asyncio for better performance in batch processing. Streaming responses allow token-by-token processing for large invoices. Different models like gpt-4o-mini can reduce cost with slightly less accuracy. You can also use Anthropic's claude-3-5-sonnet-20241022 with similar prompt engineering.

python
import os
import json
import asyncio
from openai import OpenAI

async def extract_invoice_async(invoice_text: str):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    prompt = f"""Extract invoice_number, date, bill_to, total_amount from the invoice text below as JSON.\n\nInvoice text:\n{invoice_text}\n\nJSON:"""
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    extracted_json = response.choices[0].message.content
    try:
        return json.loads(extracted_json)
    except json.JSONDecodeError:
        return {}

async def main():
    invoice_text = '''\nInvoice Number: INV-67890\nDate: 2026-04-01\nBill To: Beta LLC\nTotal Amount: $987.65\n'''
    data = await extract_invoice_async(invoice_text)
    print("Async extracted data:", data)

if __name__ == "__main__":
    asyncio.run(main())
output
Async extracted data: {'invoice_number': 'INV-67890', 'date': '2026-04-01', 'bill_to': 'Beta LLC', 'total_amount': '$987.65'}

Troubleshooting

  • If the JSON response is malformed, ensure your prompt clearly requests JSON output and consider adding explicit instructions like "Respond ONLY with JSON."
  • If you get API authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
  • For large invoices, chunk the text and extract data piecewise to avoid token limits.
  • If extraction accuracy is low, refine your prompt with examples or use a stronger model like gpt-4o.

Key Takeaways

  • Use clear, structured prompts requesting JSON to extract invoice fields with LLMs.
  • Parse the LLM's JSON response to access invoice data programmatically.
  • Async and streaming calls improve performance for batch or large invoice processing.
  • Model choice balances cost and accuracy; gpt-4o is best for precision.
  • Prompt clarity and examples improve extraction reliability.
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗