Threads: conversation containers
Why this matters
Building multi-turn conversations with the Assistants API requires understanding thread lifecycle. Threads decouple conversation state from the Assistant definition, allowing you to reuse the same Assistant across multiple independent conversations without manual history management: critical for production systems handling concurrent user sessions.
Explanation
What Threads Do: A Thread is an object that holds the conversation history between a user and an Assistant. Instead of passing the entire message history with each API call, you create a thread once, then append messages to it and run the Assistant on that thread. The API automatically tracks context, memory, and previous responses. How They Work: When you create a thread, OpenAI assigns it a unique ID and stores it server-side. Each message you add gets an immutable ID and timestamp. When you run the Assistant on a thread, the API sends only the new message plus the thread ID: the system reconstructs full context from the stored history. Tool calls, file references, and response metadata stay attached to the thread for auditing and recovery. When to Use: Threads are essential for any production conversation system: chatbots with session persistence, support ticket threads, multi-turn reasoning workflows, or anywhere you need conversation memory to survive process restarts or be shared across backend instances.
Request code
from openai import OpenAI
import json
client = OpenAI()
thread = client.beta.threads.create()
print(f"Thread created: {thread.id}")
client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="What is the capital of France?"
)
assistant = client.beta.assistants.create(
name="Geography Expert",
model="gpt-4-1106-preview",
instructions="You are a geography expert. Answer geography questions concisely."
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
import time
while run.status != "completed":
run = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
if run.status == "failed":
print(f"Run failed: {run.last_error}")
break
time.sleep(1)
messages = client.beta.threads.messages.list(thread_id=thread.id)
for msg in messages.data:
if msg.role == "assistant":
print(f"Assistant: {msg.content[0].text.value}")
else:
print(f"User: {msg.content[0].text.value}") Authentication
Ensure OPENAI_API_KEY environment variable is set before importing. The OpenAI SDK reads this at client instantiation time: `from openai import OpenAI; client = OpenAI()` automatically pulls the key. If running in a container or Lambda, pass explicitly: `client = OpenAI(api_key='sk-...')`. Threads themselves don't require additional scopes: API key must have Assistants API access (default for organization keys).
Response shape
| Field | Description |
|---|---|
id | string: unique thread identifier, use this for all subsequent thread operations |
object | string: always 'thread' |
created_at | integer: unix timestamp of thread creation |
metadata | object: custom key-value pairs you can attach (optional, empty dict by default) |
Field guide
id Store this immediately: it's your handle to the entire conversation. If lost, the thread is orphaned and unrecoverable.
metadata Attach user_id, session_id, or conversation_type here. This becomes invaluable for querying 'which threads belong to user X' or filtering by conversation purpose.
Setup trap
If you create the thread and immediately add a message, then immediately run the Assistant without waiting for the message.create() call to complete, you may get a race condition where the run executes before the message is attached. Always await or verify the message was added (response includes message ID) before calling runs.create(). In Python's synchronous SDK, this is rarely an issue, but in async contexts, it's a silent killer.
Cost
Each run consumes tokens from your model quota (gpt-4-1106-preview at ~$0.01/$0.03 per 1K input/output tokens as of April 2026). A 10-message thread with context reaching 5K tokens will consume ~5K tokens on each run. Long threads accumulate context; after 10+ turns, each new run may process 10K+ input tokens just for history. Consider thread archival after ~50 messages or implement a rolling window strategy.
Rate limits
Thread creation and message addition are cheap operations (no token cost), but runs are rate-limited per model. If you're running 10 concurrent user threads, each user might hit run rate limits if spamming requests. Implement queue-based run submission or batch threads by user to avoid 429 errors.
Common gotcha
Developers often forget that run.status polling is asynchronous and can take seconds to minutes. Hardcoding a single check or timeout of 100ms will fail. Always implement a polling loop with exponential backoff and a maximum wait time. Additionally, threads are NOT deleted automatically: they persist indefinitely and count against any API quotas or cost models, so implement cleanup for archived conversations.
Error recovery
RateLimitErrorNotFoundError with thread_idInvalidRequestError 'run_id not found'AuthenticationErrorExperienced dev note
Threads are stateful by design, which means they're also orphanable: threads with no cleanup policy will bloat your API footprint invisibly. Implement a background job that tags threads with created_at metadata and archives (or soft-deletes) threads older than N days. Secondly, don't poll run status in a tight loop: use exponential backoff starting at 1 second. Third, store thread IDs in your application database linked to user_id; this makes conversation recovery trivial if your backend crashes. Finally, test thread behavior under concurrent load: multiple messages added to the same thread before a run completes can cause subtle sequencing issues.
Check your understanding
You have a thread with 5 existing messages. You add a 6th message, then immediately call runs.create() on that thread without waiting. The run completes and returns a response. Does the Assistant's response include context from the 6th message, and why or why not?
Show answer hint
The synchronous SDK blocks until the message is created before returning control, so the 6th message exists. However, if you're in an async context or the call is somehow non-blocking in your setup, the race condition matters. The safest pattern is to always check the message.id in the response before proceeding.