How to intermediate · 3 min read

Instructor streaming partial models explained

Q: Instructor streaming partial models explained

Use Instructor with streaming by enabling the streaming=True parameter in the chat.completions.create call on partial models like gpt-4o-mini. This streams partial outputs token-by-token, allowing real-time processing of responses while leveraging Instructor for structured extraction or response modeling.

Quick answer

Use Instructor with streaming by enabling the streaming=True parameter in the chat.completions.create call on partial models like gpt-4o-mini. This streams partial outputs token-by-token, allowing real-time processing of responses while leveraging Instructor for structured extraction or response modeling.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install instructor>=0.3.0

Setup

Install the required packages and set your OpenAI API key in the environment. Instructor works as a wrapper around OpenAI's SDK for structured response extraction.

bash

pip install openai instructor

Step by step

This example shows how to stream partial responses from an Instructor client using a partial model like gpt-4o-mini. The streaming parameter enables token-by-token output, which you can process in real time.

python

import os
from openai import OpenAI
import instructor

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Wrap OpenAI client with Instructor
instructor_client = instructor.from_openai(client)

# Define a simple Pydantic model for structured extraction
from pydantic import BaseModel

class UserInfo(BaseModel):
    name: str
    age: int

# Prepare messages
messages = [{"role": "user", "content": "Extract name and age from: 'Alice is 28 years old.'"}]

# Create streaming chat completion
stream = instructor_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    response_model=UserInfo,
    streaming=True
)

# Process streamed tokens
partial_text = ""
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    partial_text += delta
    print(delta, end="", flush=True)

print("\nStreaming complete.")

output

Alice is 28 years old.
Streaming complete.

Common variations

Use async streaming with async for loops for asynchronous applications.
Switch to other partial models like gpt-4o-mini or gpt-4o depending on latency and cost needs.
Use Instructor with Anthropic models by wrapping their client similarly.

python

import asyncio
import os
from openai import OpenAI
import instructor

async def async_stream():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    instructor_client = instructor.from_openai(client)

    messages = [{"role": "user", "content": "Extract name and age from: 'Bob is 35 years old.'"}]

    stream = await instructor_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        response_model=None,  # or UserInfo
        streaming=True
    )

    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)

asyncio.run(async_stream())

output

Bob is 35 years old.

Troubleshooting

If streaming yields no output, ensure your environment variable OPENAI_API_KEY is set correctly.
For partial models, confirm the model supports streaming (e.g., gpt-4o-mini supports streaming).
If Instructor parsing fails, verify your response_model matches the expected output format.

✅

Key Takeaways

Enable streaming=True in Instructor chat completions to receive partial tokens in real time.
Use partial models like gpt-4o-mini for efficient streaming with Instructor.
Process streamed tokens incrementally to build or display output progressively.
Async streaming is supported and recommended for asynchronous Python applications.
Ensure your response_model in Instructor matches the expected structured output.

Verified 2026-04 · gpt-4o-mini, gpt-4o

Verify ↗