How to beginner · 3 min read

Handle extraction from noisy text

Quick answer
Use a robust chat.completions.create call with a clear prompt instructing the model to extract structured data from noisy text. Employ models like gpt-4o and parse the response for reliable extraction despite input noise.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the official openai Python SDK and set your API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Set your API key in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows).
bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (50 kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the gpt-4o model with a prompt that instructs the model to extract structured information from noisy text. The example below extracts a name and age from a noisy input string.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

noisy_text = "J0hn D0e is ab0ut 3O years old, livin in NY."

prompt = f"Extract the person's name and age from this noisy text:\n\n{noisy_text}\n\nRespond in JSON with keys 'name' and 'age'."

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

extracted_text = response.choices[0].message.content
print("Extracted data:", extracted_text)
output
Extracted data: {
  "name": "John Doe",
  "age": 30
}

Common variations

You can use asynchronous calls for better performance in concurrent environments, or switch to other models like gpt-4o-mini for cost efficiency. Also, consider adding system instructions to improve extraction accuracy.

python
import os
import asyncio
from openai import OpenAI

async def async_extract():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    noisy_text = "J0hn D0e is ab0ut 3O years old, livin in NY."
    prompt = f"Extract the person's name and age from this noisy text:\n\n{noisy_text}\n\nRespond in JSON with keys 'name' and 'age'."

    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Extracted data (async):", response.choices[0].message.content)

asyncio.run(async_extract())
output
Extracted data (async): {
  "name": "John Doe",
  "age": 30
}

Troubleshooting

  • If the model returns incomplete or malformed JSON, add explicit instructions to respond only with JSON.
  • If noisy text is too corrupted, consider preprocessing with regex or spell correction before extraction.
  • Check your API key environment variable if authentication errors occur.

Key Takeaways

  • Use clear, explicit prompts instructing JSON output for reliable extraction from noisy text.
  • Leverage gpt-4o or gpt-4o-mini models depending on accuracy and cost needs.
  • Async API calls improve throughput in concurrent applications.
  • Preprocessing noisy text can improve extraction quality when noise is extreme.
  • Always set your API key via environment variables to avoid authentication issues.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗