How to extract invoice data with LLM
Quick answer
Use a large language model like
gpt-4o-mini with the OpenAI Python SDK to parse invoice text and extract structured data fields. Send the invoice content as a prompt with instructions to the chat.completions.create method and parse the JSON response for invoice details.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable.
- Run
pip install openaito install the SDK. - Set your API key in your shell:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows).
pip install openai Step by step
This example shows how to extract key invoice fields like invoice number, date, vendor, and total amount from raw invoice text using gpt-4o-mini. The prompt instructs the model to return JSON with the extracted data.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
invoice_text = '''
Invoice Number: INV-12345
Date: 2026-03-15
Vendor: Acme Corporation
Total Amount: $1,234.56
Thank you for your business.
'''
prompt = f"""Extract the following fields from the invoice text below and return a JSON object with keys: invoice_number, date, vendor, total_amount.
Invoice text:\n{invoice_text}
JSON:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
extracted_json = response.choices[0].message.content
print("Extracted invoice data:", extracted_json) output
Extracted invoice data: {
"invoice_number": "INV-12345",
"date": "2026-03-15",
"vendor": "Acme Corporation",
"total_amount": "$1,234.56"
} Common variations
You can use other LLM providers like Anthropic Claude or Google Gemini with similar prompt engineering. For asynchronous calls, use async SDK methods if supported. To handle scanned PDFs, combine OCR tools (e.g., Tesseract) with LLM extraction. You can also customize the prompt to extract additional fields or output in different formats.
from anthropic import Anthropic
import os
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
invoice_text = "Invoice Number: INV-12345\nDate: 2026-03-15\nVendor: Acme Corporation\nTotal Amount: $1,234.56"
system_prompt = "You are a helpful assistant that extracts invoice data as JSON."
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": f"Extract invoice data from:\n{invoice_text}"}]
)
print("Extracted invoice data:", response.content[0].text) output
Extracted invoice data: {
"invoice_number": "INV-12345",
"date": "2026-03-15",
"vendor": "Acme Corporation",
"total_amount": "$1,234.56"
} Troubleshooting
- If the model returns unstructured text instead of JSON, clarify the prompt to explicitly request JSON output.
- If you get incomplete data, increase
max_tokensor simplify the invoice text. - For noisy OCR text, pre-clean the text before sending it to the LLM.
- Ensure your API key is correctly set in
os.environ["OPENAI_API_KEY"]to avoid authentication errors.
Key Takeaways
- Use explicit prompt instructions to get structured JSON output from LLMs for invoice data extraction.
- Combine OCR preprocessing with LLMs for scanned or image-based invoices.
- Use the latest SDK patterns with environment variables for secure and reliable API calls.