How to beginner · 4 min read

How to extract form fields with LLM

Q: How to extract form fields with LLM

Use a large language model (LLM) like gpt-4o to extract form fields by prompting it with the document text and a clear instruction to output structured data. Send the document content and extraction instructions as messages to the chat.completions.create endpoint and parse the structured response.

Quick answer

Use a large language model (LLM) like gpt-4o to extract form fields by prompting it with the document text and a clear instruction to output structured data. Send the document content and extraction instructions as messages to the chat.completions.create endpoint and parse the structured response.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

Run pip install openai to install the SDK.
Set your API key in your shell: export OPENAI_API_KEY='your_api_key_here' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key_here" (Windows).

bash

pip install openai

Step by step

This example shows how to extract form fields from a text document using gpt-4o. The prompt instructs the model to parse the form and return JSON with field names and values.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

document_text = '''
Name: John Doe
Email: john.doe@example.com
Phone: (555) 123-4567
Address: 123 Main St, Springfield
'''

prompt = f"""
Extract the form fields from the following text and return a JSON object with keys as field names and values as field values.

Text:\n{document_text}

Return only the JSON object.
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_json = response.choices[0].message.content
print("Extracted form fields:")
print(extracted_json)

output

Extracted form fields:
{
  "Name": "John Doe",
  "Email": "john.doe@example.com",
  "Phone": "(555) 123-4567",
  "Address": "123 Main St, Springfield"
}

Common variations

You can adapt this approach by:

Using other LLMs like claude-3-5-haiku-20241022 with the Anthropic SDK.
Sending scanned documents through OCR first, then passing extracted text to the LLM.
Prompting for specific field extraction or validation rules.
Using async calls or streaming if supported by your SDK.

python

import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

document_text = '''
Name: Jane Smith
Email: jane.smith@example.com
Phone: 555-987-6543
'''

system_prompt = "You are a helpful assistant that extracts form fields as JSON."
user_message = f"Extract form fields from this text:\n{document_text}\nReturn only JSON."

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=512,
    system=system_prompt,
    messages=[{"role": "user", "content": user_message}]
)

print("Extracted form fields:")
print(response.content[0].text)

output

Extracted form fields:
{
  "Name": "Jane Smith",
  "Email": "jane.smith@example.com",
  "Phone": "555-987-6543"
}

Troubleshooting

If the model returns extra text besides JSON, refine your prompt to emphasize "Return only the JSON object."
If fields are missing, increase max_tokens or split large documents into smaller chunks.
For scanned PDFs, use OCR tools like Tesseract before extraction.
Check your API key environment variable if authentication errors occur.

✅

Key Takeaways

Use clear prompts instructing the LLM to output structured JSON for form fields.
Preprocess scanned documents with OCR before passing text to the LLM.
Adjust max_tokens and chunk size for large documents to avoid truncation.
Anthropic and OpenAI both support form field extraction with similar prompt engineering.
Always keep your API keys secure and set via environment variables.

Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022

Verify ↗