How to Intermediate · 3 min read

How to extract financial data with LLM

Quick answer
Use a large language model like gpt-4o to extract financial data by prompting it with structured instructions and example formats. Send your financial text as input via the chat.completions.create API and parse the model's structured JSON or tabular output for reliable data extraction.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example shows how to extract key financial data such as revenue, net income, and EPS from a financial report snippet using gpt-4o. The prompt instructs the model to return JSON for easy parsing.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

financial_text = """
Company XYZ reported a revenue of $5 billion in 2025, with a net income of $1.2 billion and earnings per share (EPS) of $3.45.
"""

prompt = f"Extract the financial data as JSON with keys: revenue, net_income, eps.\nText:\n{financial_text}\nJSON:"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_json = response.choices[0].message.content
print("Extracted financial data:", extracted_json)
output
Extracted financial data: {
  "revenue": "$5 billion",
  "net_income": "$1.2 billion",
  "eps": "$3.45"
}

Common variations

  • Use gpt-4o-mini for faster, cheaper extraction with slightly less accuracy.
  • Implement async calls with asyncio and await for scalable extraction pipelines.
  • Use streaming mode (stream=True) to process large financial documents incrementally.
  • Customize prompts to extract additional fields like EBITDA, cash flow, or ratios.
python
import asyncio
from openai import OpenAI

async def extract_financial_data_async(text: str):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    prompt = f"Extract revenue, net_income, eps as JSON.\nText:\n{text}\nJSON:"
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    text = "Company ABC had revenue $3B, net income $800M, EPS $2.10 in 2025."
    result = await extract_financial_data_async(text)
    print("Async extracted data:", result)

asyncio.run(main())
output
Async extracted data: {
  "revenue": "$3B",
  "net_income": "$800M",
  "eps": "$2.10"
}

Troubleshooting

  • If the model returns unstructured text instead of JSON, clarify the prompt with explicit instructions like "Return only JSON, no extra text."
  • If extraction misses fields, provide example JSON outputs in the prompt to guide the model.
  • For inconsistent currency formats, normalize input text or add instructions to standardize units.
  • If you hit rate limits, implement exponential backoff or switch to a smaller model.

Key Takeaways

  • Use explicit JSON output prompts to reliably extract structured financial data from LLMs.
  • The gpt-4o model balances accuracy and cost for financial extraction tasks.
  • Async and streaming API calls enable scalable processing of large financial documents.
  • Prompt engineering with examples improves extraction quality and consistency.
  • Handle API rate limits and format inconsistencies proactively for robust pipelines.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗