API Beginner easy · 5 min

assistant role: previous assistant responses

What you will learn

Include previous assistant messages in a conversation thread to maintain context and let the model see its own reasoning history.

Why this matters

Multi-turn conversations require the model to see what it said before: without previous assistant messages, the model has no memory of its own reasoning, making follow-ups incoherent and defeating the purpose of a stateful conversation.

Skip if: Use a single-turn request if you only need one response and have no follow-up questions. For RAG systems where you inject summaries instead of full conversation history, you may reconstruct context differently. If you're building a search-over-documents system, you don't need assistant role history: you need the retrieved documents.

Explanation

The assistant role in the OpenAI Chat Completions API represents messages that came from the model itself during earlier turns of the conversation. When you send a follow-up message to continue a conversation, you must include all previous turns: both user and assistant: so the model can see what it already said and respond coherently.

Under the hood, the API doesn't store conversation state on the server. Every request is stateless: you send the entire thread of messages (system prompt, all user messages, all previous assistant responses) and the model processes them as context to generate the next token. The assistant role is simply how you tell the API "this message came from me (the model) in a previous turn." The model then uses it as input to its transformer attention mechanism to understand conversation flow.

Use this pattern whenever you're building conversational experiences: chatbots, multi-step reasoning, follow-up questions, or clarifications. The thread grows with each turn, so be mindful of token limits on very long conversations.

Request code

python

from openai import OpenAI

client = OpenAI()

# Build conversation history: each turn includes previous user and assistant messages
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 5 + 3?"},
    {"role": "assistant", "content": "5 + 3 equals 8."},
    {"role": "user", "content": "Can you explain how you got that?"},
]

# Send the entire conversation thread; the model sees its previous response
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=conversation,
    temperature=0.7,
)

print(response.choices[0].message.content)

# Store the assistant's response for the next turn
assistant_response = response.choices[0].message.content
print(f"Assistant: {assistant_response}")

# For a third turn, you would append both the user's follow-up and the assistant's response
conversation.append({"role": "assistant", "content": assistant_response})
conversation.append({"role": "user", "content": "Does that apply to negative numbers too?"})

response2 = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=conversation,
)

print(f"Second turn: {response2.choices[0].message.content}")

Authentication

Set your API key before instantiating the client. The OpenAI SDK reads OPENAI_API_KEY from environment variables at instantiation time: export OPENAI_API_KEY='your-key-here' then create the client with client = OpenAI(). Alternatively, pass it explicitly: client = OpenAI(api_key='sk-...').

Response shape

Field	Description
`choices`	List of completion objects
`choices[0].message.role`	String: 'assistant'
`choices[0].message.content`	String: the model's response text
`choices[0].finish_reason`	String: 'stop' (normal completion), 'length' (max_tokens hit), or 'tool_calls' (if using function calling)
`usage.prompt_tokens`	Integer: tokens in your input messages
`usage.completion_tokens`	Integer: tokens in the response

Field guide

choices[0].message.content

The text the model generated in response to your conversation thread. Always extract this to append back to your conversation history.

finish_reason

If this is 'length', the response was cut off: your conversation may be hitting token limits and you should consider summarizing older messages.

usage

Use this to track cost: (prompt_tokens × $0.005 + completion_tokens × $0.015) per 1000 tokens for gpt-4-turbo as of April 2026. Monitoring this reveals if your conversation history is growing unexpectedly.

Setup trap

The OpenAI SDK reads the API key when you instantiate OpenAI(), not when you make a request. If you set os.environ['OPENAI_API_KEY'] after creating the client object, it will fail silently: the client was already instantiated with a None key. Set the environment variable first, then create the client.

Cost

Long conversations become expensive because you send the entire thread every turn. A 10-turn conversation with 500 tokens per user message and 200 tokens per assistant response costs roughly: (5000 prompt_tokens × $0.005) + (2000 completion_tokens × $0.015) = $55 per conversation. Consider summarizing old messages or using a RAG pattern to inject only relevant context instead of the full thread.

Rate limits

If you're running many concurrent conversations, each client.chat.completions.create() call counts against your rate limit. A free tier account may hit 3 requests per minute. For production, use exponential backoff with jitter when you receive a 429 error.

Common gotcha

Developers forget to include the assistant's previous response when building the next request. If you send only the new user message without appending the assistant's earlier response to the messages list, the model has no memory of what it said before and will lose context. Always append the assistant response with role='assistant' before sending the next user message.

Error recovery

AuthenticationError

Your API key is invalid, missing, or has been revoked. Verify OPENAI_API_KEY is set and check your key on platform.openai.com.

BadRequestError (invalid_request_error)

Usually means a message in your thread has an invalid role (must be 'system', 'user', or 'assistant'). Check that assistant responses use role='assistant' not role='model'.

RateLimitError

You hit the API rate limit. Implement exponential backoff: import time; wait time in [0.5, 1, 2] seconds, add jitter, retry. For production, queue requests through a job processor.

ContextLengthExceededError

Your messages list exceeds the model's context window (128k tokens for gpt-4-turbo). Summarize or truncate older messages before the system prompt and most recent user message.

Experienced dev note

The entire conversation is sent on every request, so token cost scales linearly with conversation length. In production, you often want to keep only the last 5-10 turns plus the system prompt, or implement a 'summary' mechanism where old turns are replaced with a 'Here's what was discussed:' paragraph. Also: always store the full conversation thread on your backend (database), not in client-side memory: browsers refresh, sessions end, and you need the history for audit trails anyway.

Check your understanding

You have a 3-turn conversation where the user asked 'What is machine learning?' The assistant explained it. Then the user asked 'Can you give me a Python example?' Without writing code, describe exactly what messages array you would send to the API for that third turn, and explain why the assistant's first response must be included.

Show answer hint

The messages array must be: [system prompt, user turn 1, assistant turn 1, user turn 2, assistant turn 2 (response to turn 1 about examples), user turn 3]. If you omit any previous assistant response, the model loses the context of what it already said and may contradict itself or start fresh.

VERSION The assistant role syntax has been consistent in the openai 1.x SDK since its release. No breaking changes between 1.0 and 1.40+ (April 2026). Older openai 0.28.x used openai.ChatCompletion.create() with identical role strings: migration is straightforward.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.