Fix LLM extracting wrong fields
Quick answer
To fix a large language model (LLM) extracting wrong fields, use structured prompts with explicit instructions and response schemas via
response_model or JSON schema validation. Also, parse and validate the output strictly to ensure correct field extraction with the OpenAI Python SDK.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pydantic (optional for structured extraction)
Setup
Install the openai Python SDK and set your API key as an environment variable.
- Install SDK:
pip install openai - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Use explicit prompts and a response_model with pydantic to enforce correct field extraction. Validate the output strictly.
import os
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
class UserData(BaseModel):
name: str
age: int
prompt = "Extract the user's name and age from the text: 'John is 30 years old.'"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_model=UserData
)
user_data = response.response_model
print(f"Name: {user_data.name}, Age: {user_data.age}") output
Name: John, Age: 30
Common variations
You can also use raw JSON parsing if not using pydantic, or switch to other models like claude-3-5-sonnet-20241022 with the Anthropic SDK for extraction.
import json
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = (
"Extract the user's name and age as JSON from the text: 'John is 30 years old.'\n"
"Respond only with JSON like {\"name\": "", \"age\": 0}"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
try:
data = json.loads(response.choices[0].message.content)
print(f"Name: {data['name']}, Age: {data['age']}")
except (json.JSONDecodeError, KeyError) as e:
print(f"Failed to parse JSON or missing fields: {e}") output
Name: John, Age: 30
Troubleshooting
- If fields are missing or incorrect, refine your prompt to be more explicit and include examples.
- Use
response_modelor strict JSON schema validation to catch errors early. - Check for trailing text or formatting issues in the LLM output.
- Test with different models like
gpt-4o-miniorclaude-3-5-sonnet-20241022for better extraction accuracy.
Key Takeaways
- Use explicit prompts with clear instructions to improve field extraction accuracy.
- Leverage
response_modelwithpydanticfor structured output validation. - Parse and validate JSON output strictly to catch extraction errors early.
- Test multiple models to find the best extractor for your use case.
- Refine prompts iteratively based on extraction errors and missing fields.