How to beginner · 3 min read

Handle extraction from noisy text

Q: Handle extraction from noisy text

Use a robust chat.completions.create call with a clear prompt instructing the model to extract structured data from noisy text. Employ models like gpt-4o and parse the response for reliable extraction despite input noise.

Quick answer

Use a robust chat.completions.create call with a clear prompt instructing the model to extract structured data from noisy text. Employ models like gpt-4o and parse the response for reliable extraction despite input noise.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the official openai Python SDK and set your API key as an environment variable.

Run pip install openai to install the SDK.
Set your API key in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows).

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (50 kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the gpt-4o model with a prompt that instructs the model to extract structured information from noisy text. The example below extracts a name and age from a noisy input string.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

noisy_text = "J0hn D0e is ab0ut 3O years old, livin in NY."

prompt = f"Extract the person's name and age from this noisy text:\n\n{noisy_text}\n\nRespond in JSON with keys 'name' and 'age'."

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_text = response.choices[0].message.content
print("Extracted data:", extracted_text)

output

Extracted data: {
  "name": "John Doe",
  "age": 30
}

Common variations

You can use asynchronous calls for better performance in concurrent environments, or switch to other models like gpt-4o-mini for cost efficiency. Also, consider adding system instructions to improve extraction accuracy.

python

import os
import asyncio
from openai import OpenAI

async def async_extract():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    noisy_text = "J0hn D0e is ab0ut 3O years old, livin in NY."
    prompt = f"Extract the person's name and age from this noisy text:\n\n{noisy_text}\n\nRespond in JSON with keys 'name' and 'age'."

    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Extracted data (async):", response.choices[0].message.content)

asyncio.run(async_extract())

output

Extracted data (async): {
  "name": "John Doe",
  "age": 30
}

Troubleshooting

If the model returns incomplete or malformed JSON, add explicit instructions to respond only with JSON.
If noisy text is too corrupted, consider preprocessing with regex or spell correction before extraction.
Check your API key environment variable if authentication errors occur.

✅

Key Takeaways

Use clear, explicit prompts instructing JSON output for reliable extraction from noisy text.
Leverage gpt-4o or gpt-4o-mini models depending on accuracy and cost needs.
Async API calls improve throughput in concurrent applications.
Preprocessing noisy text can improve extraction quality when noise is extreme.
Always set your API key via environment variables to avoid authentication issues.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗