Code intermediate · 4 min read

How to build a RAG chatbot with OpenAI and Chroma

Direct answer
Use OpenAI's gpt-4o model for generation combined with Chroma as a vector store to retrieve relevant documents, then feed those documents as context to the LLM to build a RAG chatbot.

Setup

Install
bash
pip install openai langchain chromadb
Env vars
OPENAI_API_KEY
Imports
python
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.chains import RetrievalQA
from langchain_core.document_loaders import TextLoader
from langchain_core.embeddings import OpenAIEmbeddings

Examples

inWhat is the capital of France?
outThe capital of France is Paris.
inExplain the benefits of RAG in chatbots.
outRAG chatbots combine retrieval of relevant documents with generation, improving accuracy and grounding responses in real data.
inWho wrote 'Pride and Prejudice'?
out'Pride and Prejudice' was written by Jane Austen.

Integration steps

  1. Load and embed your documents into Chroma vector store using OpenAI embeddings.
  2. Initialize the OpenAI client with your API key from environment variables.
  3. Create a retrieval-based QA chain combining Chroma retriever and GPT-4o model.
  4. Send user queries to the chain which retrieves relevant docs and generates answers.
  5. Print or return the generated chatbot response.

Full code

python
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.chains import RetrievalQA
from langchain_core.document_loaders import TextLoader
from langchain_core.embeddings import OpenAIEmbeddings

# Load documents (example: local text file)
loader = TextLoader("./docs/sample.txt")
docs = loader.load()

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings, collection_name="my_docs")

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Setup LLM with OpenAI GPT-4o
llm = ChatOpenAI(client=client, model_name="gpt-4o", temperature=0)

# Create retrieval QA chain
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA(llm=llm, retriever=retriever)

# Query the chatbot
query = "What is the capital of France?"
answer = qa_chain.run(query)

print(f"Q: {query}\nA: {answer}")
output
Q: What is the capital of France?
A: The capital of France is Paris.

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "<retrieved docs + user query>"}]}
Response
json
{"choices": [{"message": {"content": "<generated answer>"}}], "usage": {"total_tokens": 150}}
Extractresponse.choices[0].message.content

Variants

Streaming RAG Chatbot

Use streaming to improve user experience with partial answers during long responses.

python
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.chains import RetrievalQA
from langchain_core.document_loaders import TextLoader
from langchain_core.embeddings import OpenAIEmbeddings

loader = TextLoader("./docs/sample.txt")
docs = loader.load()
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings, collection_name="my_docs")

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
llm = ChatOpenAI(client=client, model_name="gpt-4o", streaming=True, temperature=0)
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA(llm=llm, retriever=retriever)

query = "Explain RAG chatbots."
for token in qa_chain.stream(query):
    print(token, end='', flush=True)
Async RAG Chatbot

Use async for concurrent queries or integrating into async web frameworks.

python
import os
import asyncio
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.chains import RetrievalQA
from langchain_core.document_loaders import TextLoader
from langchain_core.embeddings import OpenAIEmbeddings

async def main():
    loader = TextLoader("./docs/sample.txt")
    docs = loader.load()
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(docs, embeddings, collection_name="my_docs")

    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    llm = ChatOpenAI(client=client, model_name="gpt-4o", temperature=0)
    retriever = vectorstore.as_retriever()
    qa_chain = RetrievalQA(llm=llm, retriever=retriever)

    query = "Who wrote 'Pride and Prejudice'?"
    answer = await qa_chain.arun(query)
    print(f"Q: {query}\nA: {answer}")

asyncio.run(main())
Use Claude 3.5 Sonnet for Coding-Focused RAG

Use Claude 3.5 Sonnet for better coding-related RAG chatbot performance.

python
import os
import anthropic
from langchain_community.vectorstores import Chroma
from langchain_core.document_loaders import TextLoader
from langchain_core.embeddings import OpenAIEmbeddings

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

loader = TextLoader("./docs/sample.txt")
docs = loader.load()
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings, collection_name="my_docs")

retriever = vectorstore.as_retriever()

query = "Explain how RAG improves coding assistants."

# Compose prompt with retrieved docs
retrieved_docs = retriever.get_relevant_documents(query)
prompt = """You are a helpful assistant. Use the following documents to answer the question.\n"""
for doc in retrieved_docs:
    prompt += doc.page_content + "\n"
prompt += f"Question: {query}\nAnswer:"

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": prompt}]
)
print(response.content[0].text)

Performance

Latency~800ms for gpt-4o non-streaming calls with small document sets
Cost~$0.002 per 500 tokens for gpt-4o generation plus embedding costs
Rate limitsTier 1: 500 RPM / 30K TPM for OpenAI GPT-4o
  • Limit document chunk size to reduce tokens sent to the LLM.
  • Cache embeddings to avoid recomputing for unchanged documents.
  • Use lower temperature for deterministic answers and fewer tokens.
ApproachLatencyCost/callBest for
Standard RAG with GPT-4o~800ms~$0.002General purpose accurate answers
Streaming RAG with GPT-4o~900ms~$0.002Better UX for long answers
Async RAG with GPT-4o~800ms~$0.002Concurrent queries in async apps
RAG with Claude 3.5 Sonnet~700ms~$0.0025Coding and complex reasoning tasks

Quick tip

Always embed and index your documents before querying to ensure relevant context is retrieved for the LLM.

Common mistake

Not feeding retrieved documents as context to the LLM, resulting in generic or incorrect answers.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗