Long conversation history
Why this matters
Real-world chatbots and assistants need to remember context. The Gemini API is stateless: it doesn't persist conversation state for you. You must manage history manually, which affects token costs, response latency, and how you architect your application.
Explanation
The Gemini API treats each generate_content() call as independent. To build a conversation, you explicitly pass a history list of previous messages to the model. This list includes your prior user inputs and the model's prior responses, letting Gemini understand context and maintain coherence across turns.
Under the hood, each message in the history is encoded into tokens and sent with your new prompt. The model processes the entire conversation thread: not just the latest message: to generate a response. This means token usage scales with conversation length, and you bear the cost of replaying history on every request. The API does not cache history server-side; you manage it client-side in your application code.
Use this pattern when building chatbots, multi-turn Q&A systems, or debugging assistants where users expect the model to remember previous exchanges. Store history in memory for short sessions, or in a database for long-lived conversations. Always trim old messages or implement summarization when history grows beyond 10,000+ tokens to keep latency and costs reasonable.
Request code
import google.generativeai as genai
import os
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
conversation_history = []
user_message_1 = "What are the three laws of thermodynamics?"
conversation_history.append({"role": "user", "parts": [user_message_1]})
response_1 = model.generate_content(conversation_history)
assistant_response_1 = response_1.text
conversation_history.append({"role": "model", "parts": [assistant_response_1]})
print(f"User: {user_message_1}")
print(f"Assistant: {assistant_response_1}\n")
user_message_2 = "Can you explain the second law in simpler terms?"
conversation_history.append({"role": "user", "parts": [user_message_2]})
response_2 = model.generate_content(conversation_history)
assistant_response_2 = response_2.text
conversation_history.append({"role": "model", "parts": [assistant_response_2]})
print(f"User: {user_message_2}")
print(f"Assistant: {assistant_response_2}") Authentication
Set your Google API key before instantiating the client: export GOOGLE_API_KEY="your-api-key-here" Then in Python, the library reads this automatically. No manual token refresh is required for REST-based chat.
Response shape
| Field | Description |
|---|---|
text | The complete text response from the model |
finish_reason | enum indicating how generation ended (e.g., 'STOP', 'MAX_TOKENS') |
usage_metadata | [object Object] |
Field guide
text The actual conversation response. Extract this and append to your history as a 'model' role message.
finish_reason Tells you whether the model finished naturally ('STOP') or hit a limit ('MAX_TOKENS'). If MAX_TOKENS, truncate your next message or reduce history depth.
usage_metadata.prompt_token_count Critical: this includes all history tokens, not just your latest turn. Monitor this to understand cost growth as conversations lengthen.
Setup trap
When storing history in a database for persistence, you must serialize the message structure correctly. The history list expects dictionaries with 'role' (string: 'user' or 'model') and 'parts' (list of strings or content objects). If you store only raw text without role/parts metadata, reconstruction fails silently and the model sees malformed input.
Cost
Token cost scales linearly with history length. A 10-turn conversation where each turn is ~200 tokens costs 2000 tokens just for history replay on the 10th turn. For long-running applications, implement a sliding window (keep only the last 20 turns) or summarize old exchanges into a system prompt to control costs. Each generate_content() call charges for prompt tokens (including all history), not just new input.
Rate limits
Gemini API free tier allows ~60 requests per minute. Frequent multi-turn conversations with long history can hit this limit quickly if you're running concurrent sessions. For production, implement request queuing or upgrade to a paid tier. Also, resubmitting the same long history repeatedly in quick succession wastes quota: cache responses when possible.
Common gotcha
Developers often assume the model remembers context without sending history on the second turn. If you build a list but only send the latest message to generate_content(), the model has no context and performs poorly. Always pass the full conversation_history list on every call.
Error recovery
InvalidArgument (history format)ResourceExhaustedDeadlineExceededExperienced dev note
Most developers store history as a simple list of strings and realize too late that they've lost role information or created ambiguity about who said what. Structure history from the start as a list of dicts with explicit roles. Also, token counting is invisible: use genai.count_tokens() before every generate_content() call in production to forecast costs and abort early if history exceeds your budget. This prevents surprise billing.
Check your understanding
If your user sends a 15-turn conversation and you call generate_content() with the full history, but the model responds with an answer that ignores context from turn 3, what is most likely wrong: the history format, the model's reasoning, or something about how you built the list?
Show answer hint
Check the role field and parts structure. The model processes history correctly only if each message has a valid role ('user' or 'model') and parts is a list of strings. A common mistake is appending raw text instead of wrapping it in the dict structure, or swapping role names (e.g., using 'assistant' instead of 'model').