How to Intermediate · 4 min read

How to add memory to LlamaIndex chat

Q: How to add memory to LlamaIndex chat

To add memory to LlamaIndex chat, use a vector store retriever like FAISS or Chroma to persist and retrieve conversation context. Integrate this retriever with LLMPredictor and ServiceContext to enable chat memory across sessions.

Quick answer

To add memory to LlamaIndex chat, use a vector store retriever like FAISS or Chroma to persist and retrieve conversation context. Integrate this retriever with LLMPredictor and ServiceContext to enable chat memory across sessions.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install llama-index faiss-cpu openai

Setup

Install the required packages and set your OpenAI API key as an environment variable.

bash

pip install llama-index faiss-cpu openai

Step by step

This example demonstrates adding memory to a LlamaIndex chat by using a FAISS vector store retriever to store and retrieve conversation context.

python

import os
from llama_index import (
    GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor, PromptHelper
)
from langchain_openai import ChatOpenAI
from llama_index.vector_stores import FAISS
from llama_index.retrievers import VectorIndexRetriever

# Set your OpenAI API key in environment variable
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

# Initialize LLM predictor with OpenAI GPT-4o model
llm_predictor = LLMPredictor(
    llm=ChatOpenAI(model_name="gpt-4o", temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"])
)

# Load documents from a directory
documents = SimpleDirectoryReader("./data").load_data()

# Create vector store index
index = GPTVectorStoreIndex.from_documents(documents, llm_predictor=llm_predictor)

# Persist vector store to disk (optional)
index.storage_context.persist(persist_dir="./storage")

# Create a retriever from the vector store
retriever = VectorIndexRetriever(index=index)

# Example chat memory function

def chat_with_memory(query: str):
    # Retrieve relevant context from memory
    relevant_docs = retriever.retrieve(query)

    # Combine retrieved docs with query for context-aware response
    response = llm_predictor.llm.chat(
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query},
            {"role": "assistant", "content": ""}
        ]
    )
    return response.choices[0].message.content

# Example usage
if __name__ == "__main__":
    user_input = "Explain the main points from the documents."
    answer = chat_with_memory(user_input)
    print("Assistant:", answer)

output

Assistant: The main points from the documents are ...

Common variations

Use Chroma instead of FAISS for vector storage.
Use async methods if your environment supports asynchronous calls.
Switch to other LLMs like gpt-4o-mini or gemini-1.5-pro by changing the model_name in ChatOpenAI.

Troubleshooting

If you get API key missing errors, ensure OPENAI_API_KEY is set in your environment.
If retrieval returns empty results, verify your documents are loaded and indexed correctly.
For slow responses, check your network and consider reducing max_tokens or model complexity.

Key Takeaways

Use vector stores like FAISS or Chroma to add persistent memory to LlamaIndex chat.
Integrate the retriever with your LLM predictor to enable context-aware responses.
Persist your index to disk to maintain memory across sessions.
Adjust model and vector store choices based on your latency and accuracy needs.
Always set your API keys securely via environment variables.

Verified 2026-04 · gpt-4o, gpt-4o-mini, gemini-1.5-pro

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.