How to beginner · 4 min read

How to extract form fields with LLM

Quick answer
Use a large language model (LLM) like gpt-4o to extract form fields by prompting it with the document text and a clear instruction to output structured data. Send the document content and extraction instructions as messages to the chat.completions.create endpoint and parse the structured response.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Set your API key in your shell: export OPENAI_API_KEY='your_api_key_here' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key_here" (Windows).
bash
pip install openai

Step by step

This example shows how to extract form fields from a text document using gpt-4o. The prompt instructs the model to parse the form and return JSON with field names and values.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

document_text = '''
Name: John Doe
Email: john.doe@example.com
Phone: (555) 123-4567
Address: 123 Main St, Springfield
'''

prompt = f"""
Extract the form fields from the following text and return a JSON object with keys as field names and values as field values.

Text:\n{document_text}

Return only the JSON object.
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_json = response.choices[0].message.content
print("Extracted form fields:")
print(extracted_json)
output
Extracted form fields:
{
  "Name": "John Doe",
  "Email": "john.doe@example.com",
  "Phone": "(555) 123-4567",
  "Address": "123 Main St, Springfield"
}

Common variations

You can adapt this approach by:

  • Using other LLMs like claude-3-5-haiku-20241022 with the Anthropic SDK.
  • Sending scanned documents through OCR first, then passing extracted text to the LLM.
  • Prompting for specific field extraction or validation rules.
  • Using async calls or streaming if supported by your SDK.
python
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

document_text = '''
Name: Jane Smith
Email: jane.smith@example.com
Phone: 555-987-6543
'''

system_prompt = "You are a helpful assistant that extracts form fields as JSON."
user_message = f"Extract form fields from this text:\n{document_text}\nReturn only JSON."

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=512,
    system=system_prompt,
    messages=[{"role": "user", "content": user_message}]
)

print("Extracted form fields:")
print(response.content[0].text)
output
Extracted form fields:
{
  "Name": "Jane Smith",
  "Email": "jane.smith@example.com",
  "Phone": "555-987-6543"
}

Troubleshooting

  • If the model returns extra text besides JSON, refine your prompt to emphasize "Return only the JSON object."
  • If fields are missing, increase max_tokens or split large documents into smaller chunks.
  • For scanned PDFs, use OCR tools like Tesseract before extraction.
  • Check your API key environment variable if authentication errors occur.

Key Takeaways

  • Use clear prompts instructing the LLM to output structured JSON for form fields.
  • Preprocess scanned documents with OCR before passing text to the LLM.
  • Adjust max_tokens and chunk size for large documents to avoid truncation.
  • Anthropic and OpenAI both support form field extraction with similar prompt engineering.
  • Always keep your API keys secure and set via environment variables.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022
Verify ↗