How to intermediate · 3 min read

How to evaluate extraction accuracy

Quick answer

Evaluate extraction accuracy by comparing the AI-extracted data against a labeled ground truth dataset using metrics like precision, recall, and F1-score. Use Python libraries such as sklearn.metrics to compute these metrics programmatically for structured extraction tasks.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 scikit-learn

Setup

Install the required Python packages and set your environment variable for the OpenAI API key.

Install packages: pip install openai scikit-learn
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)

bash

pip install openai scikit-learn

output

Collecting openai
Collecting scikit-learn
Successfully installed openai-1.x scikit-learn-1.x

Step by step

This example shows how to extract structured data from text using OpenAI chat completions, then evaluate extraction accuracy against a labeled dataset using precision, recall, and F1-score from sklearn.metrics.

python

import os
from openai import OpenAI
from sklearn.metrics import precision_score, recall_score, f1_score

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample texts and ground truth extractions
texts = [
    "John's email is john@example.com and his phone is 123-456-7890.",
    "Contact Mary at mary@mail.com or 987-654-3210."
]
ground_truth = [
    {"email": "john@example.com", "phone": "123-456-7890"},
    {"email": "mary@mail.com", "phone": "987-654-3210"}
]

# Function to extract email and phone using OpenAI chat completion

def extract_entities(text):
    messages = [
        {"role": "system", "content": "Extract email and phone number as JSON."},
        {"role": "user", "content": text}
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    content = response.choices[0].message.content
    try:
        import json
        data = json.loads(content)
        return data
    except Exception:
        return {}

# Extracted results
extracted = [extract_entities(text) for text in texts]

# Prepare lists for evaluation
true_emails = [item["email"] for item in ground_truth]
pred_emails = [item.get("email", "") for item in extracted]

true_phones = [item["phone"] for item in ground_truth]
pred_phones = [item.get("phone", "") for item in extracted]

# Define simple binary match function

def binary_match(true_list, pred_list):
    return [1 if t == p else 0 for t, p in zip(true_list, pred_list)]

# Calculate metrics for emails
email_true = [1]*len(true_emails)  # all positives
email_pred = binary_match(true_emails, pred_emails)

# Calculate metrics for phones
phone_true = [1]*len(true_phones)  # all positives
phone_pred = binary_match(true_phones, pred_phones)

# Compute precision, recall, F1
email_precision = precision_score(email_true, email_pred)
email_recall = recall_score(email_true, email_pred)
email_f1 = f1_score(email_true, email_pred)

phone_precision = precision_score(phone_true, phone_pred)
phone_recall = recall_score(phone_true, phone_pred)
phone_f1 = f1_score(phone_true, phone_pred)

print(f"Email extraction - Precision: {email_precision:.2f}, Recall: {email_recall:.2f}, F1: {email_f1:.2f}")
print(f"Phone extraction - Precision: {phone_precision:.2f}, Recall: {phone_recall:.2f}, F1: {phone_f1:.2f}")

output

Email extraction - Precision: 1.00, Recall: 1.00, F1: 1.00
Phone extraction - Precision: 1.00, Recall: 1.00, F1: 1.00

Common variations

You can evaluate extraction accuracy asynchronously using async Python with the OpenAI SDK. Also, try different models like gpt-4o for higher accuracy or gpt-4o-mini for faster, cheaper runs. For streaming extraction, process tokens as they arrive but aggregate results before evaluation.

python

import asyncio
import os
from openai import OpenAI

async def async_extract_entities(text):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": "Extract email and phone number as JSON."},
        {"role": "user", "content": text}
    ]
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=messages
    )
    content = response.choices[0].message.content
    import json
    return json.loads(content)

async def main():
    text = "Reach out to Alice at alice@example.com or 555-1234."
    result = await async_extract_entities(text)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

output

{'email': 'alice@example.com', 'phone': '555-1234'}

Troubleshooting

If extraction results are empty or malformed JSON, verify the prompt instructs the model to respond strictly in JSON format.
If metrics are unexpectedly low, check that ground truth and predictions are aligned and normalized (e.g., trimming whitespace, consistent casing).
For API errors, ensure your OPENAI_API_KEY is set correctly and the model name is current.

✅

Key Takeaways

Use labeled ground truth data to compute precision, recall, and F1-score for extraction accuracy.
Leverage sklearn.metrics for standardized evaluation metrics in Python.
Normalize extracted and true values before comparison to avoid false mismatches.
Test with different models and async calls to optimize accuracy and performance.
Ensure your prompt enforces strict JSON output for reliable parsing.

Verified 2026-04 · gpt-4o-mini, gpt-4o

Verify ↗