How to beginner · 3 min read

How to extract invoice data with LLM

Quick answer
Use a large language model like gpt-4o-mini with the OpenAI Python SDK to parse invoice text and extract structured data fields. Send the invoice content as a prompt with instructions to the chat.completions.create method and parse the JSON response for invoice details.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Set your API key in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows).
bash
pip install openai

Step by step

This example shows how to extract key invoice fields like invoice number, date, vendor, and total amount from raw invoice text using gpt-4o-mini. The prompt instructs the model to return JSON with the extracted data.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

invoice_text = '''
Invoice Number: INV-12345
Date: 2026-03-15
Vendor: Acme Corporation
Total Amount: $1,234.56

Thank you for your business.
'''

prompt = f"""Extract the following fields from the invoice text below and return a JSON object with keys: invoice_number, date, vendor, total_amount.

Invoice text:\n{invoice_text}

JSON:"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

extracted_json = response.choices[0].message.content
print("Extracted invoice data:", extracted_json)
output
Extracted invoice data: {
  "invoice_number": "INV-12345",
  "date": "2026-03-15",
  "vendor": "Acme Corporation",
  "total_amount": "$1,234.56"
}

Common variations

You can use other LLM providers like Anthropic Claude or Google Gemini with similar prompt engineering. For asynchronous calls, use async SDK methods if supported. To handle scanned PDFs, combine OCR tools (e.g., Tesseract) with LLM extraction. You can also customize the prompt to extract additional fields or output in different formats.

python
from anthropic import Anthropic
import os

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

invoice_text = "Invoice Number: INV-12345\nDate: 2026-03-15\nVendor: Acme Corporation\nTotal Amount: $1,234.56"

system_prompt = "You are a helpful assistant that extracts invoice data as JSON."

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=512,
    system=system_prompt,
    messages=[{"role": "user", "content": f"Extract invoice data from:\n{invoice_text}"}]
)

print("Extracted invoice data:", response.content[0].text)
output
Extracted invoice data: {
  "invoice_number": "INV-12345",
  "date": "2026-03-15",
  "vendor": "Acme Corporation",
  "total_amount": "$1,234.56"
}

Troubleshooting

  • If the model returns unstructured text instead of JSON, clarify the prompt to explicitly request JSON output.
  • If you get incomplete data, increase max_tokens or simplify the invoice text.
  • For noisy OCR text, pre-clean the text before sending it to the LLM.
  • Ensure your API key is correctly set in os.environ["OPENAI_API_KEY"] to avoid authentication errors.

Key Takeaways

  • Use explicit prompt instructions to get structured JSON output from LLMs for invoice data extraction.
  • Combine OCR preprocessing with LLMs for scanned or image-based invoices.
  • Use the latest SDK patterns with environment variables for secure and reliable API calls.
Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022
Verify ↗