How to Intermediate · 3 min read

How to reduce structured output latency

Q: How to reduce structured output latency

To reduce structured output latency, use streaming mode to receive partial outputs as they generate and apply JSON schema validation client-side to parse incrementally. Also, optimize prompt design to minimize token usage and select faster models like gpt-4o-mini when appropriate.

Quick answer

To reduce structured output latency, use streaming mode to receive partial outputs as they generate and apply JSON schema validation client-side to parse incrementally. Also, optimize prompt design to minimize token usage and select faster models like gpt-4o-mini when appropriate.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the official openai Python SDK and set your API key as an environment variable.

bash

pip install openai>=1.0

Step by step

This example demonstrates reducing latency by using streaming with structured JSON output validation from gpt-4o. It streams partial responses and parses JSON incrementally.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define JSON schema for structured output
json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age", "email"]
}

# Prompt requesting structured JSON output
messages = [
    {"role": "user", "content": "Provide user info as JSON with fields name, age, and email."}
]

# Streaming call to reduce latency
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True,
    functions=[{
        "name": "user_info",
        "parameters": json_schema
    }]
)

print("Streaming structured output:")
collected = ""
for chunk in response:
    delta = chunk.choices[0].delta
    if "content" in delta:
        print(delta["content"], end="", flush=True)
        collected += delta["content"]

print("\n\nFull collected output:")
print(collected)

output

Streaming structured output:
{
  "name": "Alice",
  "age": 30,
  "email": "alice@example.com"
}

Full collected output:
{
  "name": "Alice",
  "age": 30,
  "email": "alice@example.com"
}

Common variations

You can reduce latency further by:

Using smaller models like gpt-4o-mini for faster responses.
Pre-validating prompt templates to minimize token usage.
Using async calls with asyncio for concurrent requests.
Applying client-side JSON streaming parsers to process partial data immediately.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def stream_structured_output():
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Send JSON with name and age."}],
        stream=True
    )
    collected = ""
    async for chunk in response:
        delta = chunk.choices[0].delta
        if "content" in delta:
            print(delta["content"], end="", flush=True)
            collected += delta["content"]
    print("\nFull output:", collected)

asyncio.run(stream_structured_output())

output

Streaming JSON partials printed live, then full output printed after stream ends.

Troubleshooting

If streaming output is incomplete or malformed, verify your JSON schema matches expected output.
High latency may be due to large token usage; simplify prompts or switch to smaller models.
Network issues can cause stream interruptions; implement retry logic.

✅

Key Takeaways

Use streaming mode to receive partial structured outputs immediately, reducing perceived latency.
Validate and parse JSON incrementally client-side to handle structured data efficiently.
Choose smaller, faster models like gpt-4o-mini when latency is critical.
Optimize prompt length and complexity to minimize token consumption and speed up responses.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗