How to Intermediate · 4 min read

How to add memory to LlamaIndex chat

Quick answer
To add memory to LlamaIndex chat, use a vector store retriever like FAISS or Chroma to persist and retrieve conversation context. Integrate this retriever with LLMPredictor and ServiceContext to enable chat memory across sessions.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install llama-index faiss-cpu openai

Setup

Install the required packages and set your OpenAI API key as an environment variable.

bash
pip install llama-index faiss-cpu openai

Step by step

This example demonstrates adding memory to a LlamaIndex chat by using a FAISS vector store retriever to store and retrieve conversation context.

python
import os
from llama_index import (
    GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor, PromptHelper
)
from langchain_openai import ChatOpenAI
from llama_index.vector_stores import FAISS
from llama_index.retrievers import VectorIndexRetriever

# Set your OpenAI API key in environment variable
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

# Initialize LLM predictor with OpenAI GPT-4o model
llm_predictor = LLMPredictor(
    llm=ChatOpenAI(model_name="gpt-4o", temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"])
)

# Load documents from a directory
documents = SimpleDirectoryReader("./data").load_data()

# Create vector store index
index = GPTVectorStoreIndex.from_documents(documents, llm_predictor=llm_predictor)

# Persist vector store to disk (optional)
index.storage_context.persist(persist_dir="./storage")

# Create a retriever from the vector store
retriever = VectorIndexRetriever(index=index)

# Example chat memory function

def chat_with_memory(query: str):
    # Retrieve relevant context from memory
    relevant_docs = retriever.retrieve(query)

    # Combine retrieved docs with query for context-aware response
    response = llm_predictor.llm.chat(
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query},
            {"role": "assistant", "content": ""}
        ]
    )
    return response.choices[0].message.content

# Example usage
if __name__ == "__main__":
    user_input = "Explain the main points from the documents."
    answer = chat_with_memory(user_input)
    print("Assistant:", answer)
output
Assistant: The main points from the documents are ...

Common variations

  • Use Chroma instead of FAISS for vector storage.
  • Use async methods if your environment supports asynchronous calls.
  • Switch to other LLMs like gpt-4o-mini or gemini-1.5-pro by changing the model_name in ChatOpenAI.

Troubleshooting

  • If you get API key missing errors, ensure OPENAI_API_KEY is set in your environment.
  • If retrieval returns empty results, verify your documents are loaded and indexed correctly.
  • For slow responses, check your network and consider reducing max_tokens or model complexity.

Key Takeaways

  • Use vector stores like FAISS or Chroma to add persistent memory to LlamaIndex chat.
  • Integrate the retriever with your LLM predictor to enable context-aware responses.
  • Persist your index to disk to maintain memory across sessions.
  • Adjust model and vector store choices based on your latency and accuracy needs.
  • Always set your API keys securely via environment variables.
Verified 2026-04 · gpt-4o, gpt-4o-mini, gemini-1.5-pro
Verify ↗