How to Intermediate · 3 min read

How AI agents remember across conversations

Quick answer
AI agents remember across conversations by maintaining a conversation history within the messages array and using techniques like embedding-based vector stores or external databases to persist context beyond token limits. This enables agents to recall past interactions and provide coherent multi-turn dialogue.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install faiss-cpu (optional for vector storage)

Setup

Install the openai Python SDK and optionally faiss-cpu for vector storage to enable persistent memory across conversations.

  • Set your OpenAI API key as an environment variable OPENAI_API_KEY.
bash
pip install openai faiss-cpu
output
Collecting openai
Collecting faiss-cpu
Successfully installed openai-1.x.x faiss-cpu-1.x.x

Step by step

This example shows how to maintain conversation memory by appending previous messages and optionally using embeddings with FAISS to recall long-term context.

python
import os
from openai import OpenAI
import faiss
import numpy as np

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Conversation history stored in memory
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant."}
]

# Function to add user and assistant messages

def add_message(role, content):
    conversation_history.append({"role": role, "content": content})

# Function to get chat completion with full history

def chat_with_memory(user_input):
    add_message("user", user_input)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=conversation_history
    )
    assistant_reply = response.choices[0].message.content
    add_message("assistant", assistant_reply)
    return assistant_reply

# Example usage
print("User: Hello, who won the World Cup in 2018?")
reply1 = chat_with_memory("Hello, who won the World Cup in 2018?")
print("Assistant:", reply1)

print("User: And who was the top scorer?")
reply2 = chat_with_memory("And who was the top scorer?")
print("Assistant:", reply2)

# Optional: Using embeddings to store and retrieve long-term memory

# Create embeddings for conversation turns
embedding_model = "text-embedding-3-small"

# Initialize FAISS index
embedding_dim = 384  # dimension for text-embedding-3-small
index = faiss.IndexFlatL2(embedding_dim)

# Store texts and embeddings
texts = []
embeddings = []

# Add conversation turns to vector store
for msg in conversation_history:
    if msg["role"] != "system":
        texts.append(msg["content"])
        emb_response = client.embeddings.create(model=embedding_model, input=msg["content"])
        emb_vector = np.array(emb_response.data[0].embedding, dtype=np.float32)
        embeddings.append(emb_vector)

if embeddings:
    index.add(np.vstack(embeddings))

# Function to retrieve relevant past messages

def retrieve_similar(query, k=2):
    query_emb = client.embeddings.create(model=embedding_model, input=query).data[0].embedding
    query_vector = np.array(query_emb, dtype=np.float32).reshape(1, -1)
    distances, indices = index.search(query_vector, k)
    return [texts[i] for i in indices[0]]

# Retrieve relevant context
query = "Who was the top scorer in the 2018 World Cup?"
relevant_context = retrieve_similar(query)
print("Relevant past messages:", relevant_context)
output
User: Hello, who won the World Cup in 2018?
Assistant: France won the 2018 FIFA World Cup.
User: And who was the top scorer?
Assistant: The top scorer of the 2018 World Cup was Harry Kane with 6 goals.
Relevant past messages: ['France won the 2018 FIFA World Cup.', 'The top scorer of the 2018 World Cup was Harry Kane with 6 goals.']

Common variations

You can implement memory in different ways:

  • Async calls: Use async and await with the OpenAI SDK for non-blocking memory management.
  • Streaming: Stream responses token-by-token while maintaining conversation history.
  • Different models: Use gpt-4o-mini for cost-effective memory or claude-3-5-sonnet-20241022 with Anthropic SDK for similar patterns.
  • External databases: Store conversation history or embeddings in persistent DBs like Pinecone or Chroma for scalable memory.
python
import asyncio
from openai import OpenAI

async def async_chat():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "Hello"}]
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    print(response.choices[0].message.content)

asyncio.run(async_chat())
output
Hello! How can I assist you today?

Troubleshooting

  • If conversation history grows too large, the model may truncate earlier messages; use embeddings or external vector stores to persist long-term memory.
  • If you get RateLimitError, reduce request frequency or upgrade your API plan.
  • Ensure your environment variable OPENAI_API_KEY is set correctly to avoid authentication errors.
  • For embedding dimension mismatches in FAISS, verify the embedding model and index dimension align.

Key Takeaways

  • Maintain conversation history in the messages array to provide context for multi-turn dialogue.
  • Use embedding-based vector stores like FAISS to persist and retrieve long-term memory beyond token limits.
  • Async and streaming calls enable efficient memory handling in real-time applications.
  • External databases scale memory for agents handling many users or long conversations.
  • Always manage API keys securely via environment variables to avoid authentication issues.
Verified 2026-04 · gpt-4o, gpt-4o-mini, text-embedding-3-small, claude-3-5-sonnet-20241022
Verify ↗