API Intermediate medium · 6 min

Long conversation history

What you will learn

Maintain multi-turn conversations with Gemini by storing and replaying message history across API calls.

Why this matters

Real-world chatbots and assistants need to remember context. The Gemini API is stateless: it doesn't persist conversation state for you. You must manage history manually, which affects token costs, response latency, and how you architect your application.

Skip if: Use this pattern only if you need cross-session persistence. For single-turn analysis, one-off summaries, or purely synchronous request/response workflows, build history management into your backend or use a managed chat service instead. This is also unnecessary if your use case is read-only (no conversational turns).

Explanation

The Gemini API treats each generate_content() call as independent. To build a conversation, you explicitly pass a history list of previous messages to the model. This list includes your prior user inputs and the model's prior responses, letting Gemini understand context and maintain coherence across turns.

Under the hood, each message in the history is encoded into tokens and sent with your new prompt. The model processes the entire conversation thread: not just the latest message: to generate a response. This means token usage scales with conversation length, and you bear the cost of replaying history on every request. The API does not cache history server-side; you manage it client-side in your application code.

Use this pattern when building chatbots, multi-turn Q&A systems, or debugging assistants where users expect the model to remember previous exchanges. Store history in memory for short sessions, or in a database for long-lived conversations. Always trim old messages or implement summarization when history grows beyond 10,000+ tokens to keep latency and costs reasonable.

Request code

python

import google.generativeai as genai
import os

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')

conversation_history = []

user_message_1 = "What are the three laws of thermodynamics?"
conversation_history.append({"role": "user", "parts": [user_message_1]})

response_1 = model.generate_content(conversation_history)
assistant_response_1 = response_1.text
conversation_history.append({"role": "model", "parts": [assistant_response_1]})

print(f"User: {user_message_1}")
print(f"Assistant: {assistant_response_1}\n")

user_message_2 = "Can you explain the second law in simpler terms?"
conversation_history.append({"role": "user", "parts": [user_message_2]})

response_2 = model.generate_content(conversation_history)
assistant_response_2 = response_2.text
conversation_history.append({"role": "model", "parts": [assistant_response_2]})

print(f"User: {user_message_2}")
print(f"Assistant: {assistant_response_2}")

Authentication

Set your Google API key before instantiating the client: export GOOGLE_API_KEY="your-api-key-here" Then in Python, the library reads this automatically. No manual token refresh is required for REST-based chat.

Response shape

Field	Description
`text`	The complete text response from the model
`finish_reason`	enum indicating how generation ended (e.g., 'STOP', 'MAX_TOKENS')
`usage_metadata`	[object Object]

Field guide

text

The actual conversation response. Extract this and append to your history as a 'model' role message.

finish_reason

Tells you whether the model finished naturally ('STOP') or hit a limit ('MAX_TOKENS'). If MAX_TOKENS, truncate your next message or reduce history depth.

usage_metadata.prompt_token_count

Critical: this includes all history tokens, not just your latest turn. Monitor this to understand cost growth as conversations lengthen.

Setup trap

When storing history in a database for persistence, you must serialize the message structure correctly. The history list expects dictionaries with 'role' (string: 'user' or 'model') and 'parts' (list of strings or content objects). If you store only raw text without role/parts metadata, reconstruction fails silently and the model sees malformed input.

Cost

Token cost scales linearly with history length. A 10-turn conversation where each turn is ~200 tokens costs 2000 tokens just for history replay on the 10th turn. For long-running applications, implement a sliding window (keep only the last 20 turns) or summarize old exchanges into a system prompt to control costs. Each generate_content() call charges for prompt tokens (including all history), not just new input.

Rate limits

Gemini API free tier allows ~60 requests per minute. Frequent multi-turn conversations with long history can hit this limit quickly if you're running concurrent sessions. For production, implement request queuing or upgrade to a paid tier. Also, resubmitting the same long history repeatedly in quick succession wastes quota: cache responses when possible.

Common gotcha

Developers often assume the model remembers context without sending history on the second turn. If you build a list but only send the latest message to generate_content(), the model has no context and performs poorly. Always pass the full conversation_history list on every call.

Error recovery

InvalidArgument (history format)

Error message like 'invalid value for key: role'. Verify each history item is a dict with 'role' (string: 'user' or 'model') and 'parts' (list). Do not nest parts further; parts should be strings or ContentPart objects, not nested lists.

ResourceExhausted

You have hit the rate limit or quota. Wait before retrying (exponential backoff: 1s, 2s, 4s, 8s). For production, use a queue and batch requests.

DeadlineExceeded

The API call timed out, usually because the history is extremely long (>50,000 tokens) or the model is overloaded. Trim history or split into separate conversations.

Experienced dev note

Most developers store history as a simple list of strings and realize too late that they've lost role information or created ambiguity about who said what. Structure history from the start as a list of dicts with explicit roles. Also, token counting is invisible: use genai.count_tokens() before every generate_content() call in production to forecast costs and abort early if history exceeds your budget. This prevents surprise billing.

Check your understanding

If your user sends a 15-turn conversation and you call generate_content() with the full history, but the model responds with an answer that ignores context from turn 3, what is most likely wrong: the history format, the model's reasoning, or something about how you built the list?

Show answer hint

Check the role field and parts structure. The model processes history correctly only if each message has a valid role ('user' or 'model') and parts is a list of strings. A common mistake is appending raw text instead of wrapping it in the dict structure, or swapping role names (e.g., using 'assistant' instead of 'model').

VERSION google-generativeai 0.8.x uses the 'role' field with values 'user' and 'model'. Earlier versions (0.3.x–0.7.x) used 'author' with values 'user' and 'assistant'. If upgrading, update all role labels. The 'parts' field has always been a list; do not pass raw strings directly as the message value.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.