How to extract form fields with LLM
Quick answer
Use a large language model (LLM) like
gpt-4o to extract form fields by prompting it with the document text and a clear instruction to output structured data. Send the document content and extraction instructions as messages to the chat.completions.create endpoint and parse the structured response.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable.
- Run
pip install openaito install the SDK. - Set your API key in your shell:
export OPENAI_API_KEY='your_api_key_here'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key_here"(Windows).
pip install openai Step by step
This example shows how to extract form fields from a text document using gpt-4o. The prompt instructs the model to parse the form and return JSON with field names and values.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
document_text = '''
Name: John Doe
Email: john.doe@example.com
Phone: (555) 123-4567
Address: 123 Main St, Springfield
'''
prompt = f"""
Extract the form fields from the following text and return a JSON object with keys as field names and values as field values.
Text:\n{document_text}
Return only the JSON object.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
extracted_json = response.choices[0].message.content
print("Extracted form fields:")
print(extracted_json) output
Extracted form fields:
{
"Name": "John Doe",
"Email": "john.doe@example.com",
"Phone": "(555) 123-4567",
"Address": "123 Main St, Springfield"
} Common variations
You can adapt this approach by:
- Using other LLMs like
claude-3-5-haiku-20241022with the Anthropic SDK. - Sending scanned documents through OCR first, then passing extracted text to the LLM.
- Prompting for specific field extraction or validation rules.
- Using async calls or streaming if supported by your SDK.
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
document_text = '''
Name: Jane Smith
Email: jane.smith@example.com
Phone: 555-987-6543
'''
system_prompt = "You are a helpful assistant that extracts form fields as JSON."
user_message = f"Extract form fields from this text:\n{document_text}\nReturn only JSON."
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
print("Extracted form fields:")
print(response.content[0].text) output
Extracted form fields:
{
"Name": "Jane Smith",
"Email": "jane.smith@example.com",
"Phone": "555-987-6543"
} Troubleshooting
- If the model returns extra text besides JSON, refine your prompt to emphasize "Return only the JSON object."
- If fields are missing, increase
max_tokensor split large documents into smaller chunks. - For scanned PDFs, use OCR tools like Tesseract before extraction.
- Check your API key environment variable if authentication errors occur.
Key Takeaways
- Use clear prompts instructing the LLM to output structured JSON for form fields.
- Preprocess scanned documents with OCR before passing text to the LLM.
- Adjust
max_tokensand chunk size for large documents to avoid truncation. - Anthropic and OpenAI both support form field extraction with similar prompt engineering.
- Always keep your API keys secure and set via environment variables.