assistant role: previous assistant responses
Why this matters
Multi-turn conversations require the model to see what it said before: without previous assistant messages, the model has no memory of its own reasoning, making follow-ups incoherent and defeating the purpose of a stateful conversation.
Explanation
The assistant role in the OpenAI Chat Completions API represents messages that came from the model itself during earlier turns of the conversation. When you send a follow-up message to continue a conversation, you must include all previous turns: both user and assistant: so the model can see what it already said and respond coherently.
Under the hood, the API doesn't store conversation state on the server. Every request is stateless: you send the entire thread of messages (system prompt, all user messages, all previous assistant responses) and the model processes them as context to generate the next token. The assistant role is simply how you tell the API "this message came from me (the model) in a previous turn." The model then uses it as input to its transformer attention mechanism to understand conversation flow.
Use this pattern whenever you're building conversational experiences: chatbots, multi-step reasoning, follow-up questions, or clarifications. The thread grows with each turn, so be mindful of token limits on very long conversations.
Request code
from openai import OpenAI
client = OpenAI()
# Build conversation history: each turn includes previous user and assistant messages
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 5 + 3?"},
{"role": "assistant", "content": "5 + 3 equals 8."},
{"role": "user", "content": "Can you explain how you got that?"},
]
# Send the entire conversation thread; the model sees its previous response
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=conversation,
temperature=0.7,
)
print(response.choices[0].message.content)
# Store the assistant's response for the next turn
assistant_response = response.choices[0].message.content
print(f"Assistant: {assistant_response}")
# For a third turn, you would append both the user's follow-up and the assistant's response
conversation.append({"role": "assistant", "content": assistant_response})
conversation.append({"role": "user", "content": "Does that apply to negative numbers too?"})
response2 = client.chat.completions.create(
model="gpt-4-turbo",
messages=conversation,
)
print(f"Second turn: {response2.choices[0].message.content}") Authentication
Set your API key before instantiating the client. The OpenAI SDK reads OPENAI_API_KEY from environment variables at instantiation time: export OPENAI_API_KEY='your-key-here' then create the client with client = OpenAI(). Alternatively, pass it explicitly: client = OpenAI(api_key='sk-...').
Response shape
| Field | Description |
|---|---|
choices | List of completion objects |
choices[0].message.role | String: 'assistant' |
choices[0].message.content | String: the model's response text |
choices[0].finish_reason | String: 'stop' (normal completion), 'length' (max_tokens hit), or 'tool_calls' (if using function calling) |
usage.prompt_tokens | Integer: tokens in your input messages |
usage.completion_tokens | Integer: tokens in the response |
Field guide
choices[0].message.content The text the model generated in response to your conversation thread. Always extract this to append back to your conversation history.
finish_reason If this is 'length', the response was cut off: your conversation may be hitting token limits and you should consider summarizing older messages.
usage Use this to track cost: (prompt_tokens × $0.005 + completion_tokens × $0.015) per 1000 tokens for gpt-4-turbo as of April 2026. Monitoring this reveals if your conversation history is growing unexpectedly.
Setup trap
The OpenAI SDK reads the API key when you instantiate OpenAI(), not when you make a request. If you set os.environ['OPENAI_API_KEY'] after creating the client object, it will fail silently: the client was already instantiated with a None key. Set the environment variable first, then create the client.
Cost
Long conversations become expensive because you send the entire thread every turn. A 10-turn conversation with 500 tokens per user message and 200 tokens per assistant response costs roughly: (5000 prompt_tokens × $0.005) + (2000 completion_tokens × $0.015) = $55 per conversation. Consider summarizing old messages or using a RAG pattern to inject only relevant context instead of the full thread.
Rate limits
If you're running many concurrent conversations, each client.chat.completions.create() call counts against your rate limit. A free tier account may hit 3 requests per minute. For production, use exponential backoff with jitter when you receive a 429 error.
Common gotcha
Developers forget to include the assistant's previous response when building the next request. If you send only the new user message without appending the assistant's earlier response to the messages list, the model has no memory of what it said before and will lose context. Always append the assistant response with role='assistant' before sending the next user message.
Error recovery
AuthenticationErrorBadRequestError (invalid_request_error)RateLimitErrorContextLengthExceededErrorExperienced dev note
The entire conversation is sent on every request, so token cost scales linearly with conversation length. In production, you often want to keep only the last 5-10 turns plus the system prompt, or implement a 'summary' mechanism where old turns are replaced with a 'Here's what was discussed:' paragraph. Also: always store the full conversation thread on your backend (database), not in client-side memory: browsers refresh, sessions end, and you need the history for audit trails anyway.
Check your understanding
You have a 3-turn conversation where the user asked 'What is machine learning?' The assistant explained it. Then the user asked 'Can you give me a Python example?' Without writing code, describe exactly what messages array you would send to the API for that third turn, and explain why the assistant's first response must be included.
Show answer hint
The messages array must be: [system prompt, user turn 1, assistant turn 1, user turn 2, assistant turn 2 (response to turn 1 about examples), user turn 3]. If you omit any previous assistant response, the model loses the context of what it already said and may contradict itself or start fresh.