How to intermediate · 3 min read

How to evaluate extraction accuracy

Quick answer
Evaluate extraction accuracy by comparing the AI-extracted data against a labeled ground truth dataset using metrics like precision, recall, and F1-score. Use Python libraries such as sklearn.metrics to compute these metrics programmatically for structured extraction tasks.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 scikit-learn

Setup

Install the required Python packages and set your environment variable for the OpenAI API key.

  • Install packages: pip install openai scikit-learn
  • Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)
bash
pip install openai scikit-learn
output
Collecting openai
Collecting scikit-learn
Successfully installed openai-1.x scikit-learn-1.x

Step by step

This example shows how to extract structured data from text using OpenAI chat completions, then evaluate extraction accuracy against a labeled dataset using precision, recall, and F1-score from sklearn.metrics.

python
import os
from openai import OpenAI
from sklearn.metrics import precision_score, recall_score, f1_score

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample texts and ground truth extractions
texts = [
    "John's email is john@example.com and his phone is 123-456-7890.",
    "Contact Mary at mary@mail.com or 987-654-3210."
]
ground_truth = [
    {"email": "john@example.com", "phone": "123-456-7890"},
    {"email": "mary@mail.com", "phone": "987-654-3210"}
]

# Function to extract email and phone using OpenAI chat completion

def extract_entities(text):
    messages = [
        {"role": "system", "content": "Extract email and phone number as JSON."},
        {"role": "user", "content": text}
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    content = response.choices[0].message.content
    try:
        import json
        data = json.loads(content)
        return data
    except Exception:
        return {}

# Extracted results
extracted = [extract_entities(text) for text in texts]

# Prepare lists for evaluation
true_emails = [item["email"] for item in ground_truth]
pred_emails = [item.get("email", "") for item in extracted]

true_phones = [item["phone"] for item in ground_truth]
pred_phones = [item.get("phone", "") for item in extracted]

# Define simple binary match function

def binary_match(true_list, pred_list):
    return [1 if t == p else 0 for t, p in zip(true_list, pred_list)]

# Calculate metrics for emails
email_true = [1]*len(true_emails)  # all positives
email_pred = binary_match(true_emails, pred_emails)

# Calculate metrics for phones
phone_true = [1]*len(true_phones)  # all positives
phone_pred = binary_match(true_phones, pred_phones)

# Compute precision, recall, F1
email_precision = precision_score(email_true, email_pred)
email_recall = recall_score(email_true, email_pred)
email_f1 = f1_score(email_true, email_pred)

phone_precision = precision_score(phone_true, phone_pred)
phone_recall = recall_score(phone_true, phone_pred)
phone_f1 = f1_score(phone_true, phone_pred)

print(f"Email extraction - Precision: {email_precision:.2f}, Recall: {email_recall:.2f}, F1: {email_f1:.2f}")
print(f"Phone extraction - Precision: {phone_precision:.2f}, Recall: {phone_recall:.2f}, F1: {phone_f1:.2f}")
output
Email extraction - Precision: 1.00, Recall: 1.00, F1: 1.00
Phone extraction - Precision: 1.00, Recall: 1.00, F1: 1.00

Common variations

You can evaluate extraction accuracy asynchronously using async Python with the OpenAI SDK. Also, try different models like gpt-4o for higher accuracy or gpt-4o-mini for faster, cheaper runs. For streaming extraction, process tokens as they arrive but aggregate results before evaluation.

python
import asyncio
import os
from openai import OpenAI

async def async_extract_entities(text):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": "Extract email and phone number as JSON."},
        {"role": "user", "content": text}
    ]
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=messages
    )
    content = response.choices[0].message.content
    import json
    return json.loads(content)

async def main():
    text = "Reach out to Alice at alice@example.com or 555-1234."
    result = await async_extract_entities(text)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
output
{'email': 'alice@example.com', 'phone': '555-1234'}

Troubleshooting

  • If extraction results are empty or malformed JSON, verify the prompt instructs the model to respond strictly in JSON format.
  • If metrics are unexpectedly low, check that ground truth and predictions are aligned and normalized (e.g., trimming whitespace, consistent casing).
  • For API errors, ensure your OPENAI_API_KEY is set correctly and the model name is current.

Key Takeaways

  • Use labeled ground truth data to compute precision, recall, and F1-score for extraction accuracy.
  • Leverage sklearn.metrics for standardized evaluation metrics in Python.
  • Normalize extracted and true values before comparison to avoid false mismatches.
  • Test with different models and async calls to optimize accuracy and performance.
  • Ensure your prompt enforces strict JSON output for reliable parsing.
Verified 2026-04 · gpt-4o-mini, gpt-4o
Verify ↗