How to build a RAG pipeline with LangChain
Direct answer
Use LangChain's document loaders, OpenAI embeddings, and FAISS vectorstore to build a RAG pipeline that retrieves relevant documents and generates answers with
ChatOpenAI.Setup
Install
pip install langchain_openai langchain_community faiss-cpu Env vars
OPENAI_API_KEY Imports
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
from langchain_core.prompts import ChatPromptTemplate Examples
inWhat is LangChain?
outLangChain is a framework for building applications with LLMs through composable components like document loaders, vectorstores, and chains.
inExplain RAG pipeline in LangChain.
outA RAG pipeline in LangChain loads documents, creates embeddings, stores them in a vectorstore, retrieves relevant docs for a query, and generates answers using an LLM.
inHow to handle empty query?
outThe pipeline returns a default message or empty response if no relevant documents are found for the query.
Integration steps
- Install required packages and set OPENAI_API_KEY in environment variables.
- Load your documents using a LangChain document loader like TextLoader.
- Create embeddings for documents with OpenAIEmbeddings.
- Index embeddings in a FAISS vectorstore for efficient similarity search.
- Initialize a ChatOpenAI model for answer generation.
- Build a RetrievalQA chain combining the retriever and LLM to answer queries.
Full code
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
# Load documents from local text files
loader = TextLoader("./docs/sample.txt")
docs = loader.load()
# Create embeddings for documents
embeddings = OpenAIEmbeddings()
# Build FAISS vectorstore from documents
vectorstore = FAISS.from_documents(docs, embeddings)
# Initialize ChatOpenAI model
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Create RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
# Query example
query = "What is LangChain?"
answer = qa_chain.run(query)
print(f"Query: {query}\nAnswer: {answer}") output
Query: What is LangChain? Answer: LangChain is a framework that helps developers build applications with large language models by combining document loading, vector search, and LLMs for generation.
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "<retrieved context>\nQuestion: What is LangChain?"}]} Response
{"choices": [{"message": {"content": "LangChain is a framework that helps developers build applications with large language models..."}}], "usage": {"total_tokens": 150}} Extract
response.choices[0].message.contentVariants
Streaming RAG Pipeline ›
Use streaming to provide real-time token-by-token output for better user experience on long answers.
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
loader = TextLoader("./docs/sample.txt")
docs = loader.load()
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
llm = ChatOpenAI(model="gpt-4o", streaming=True, temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
query = "Explain RAG pipeline."
for token in qa_chain.stream(query):
print(token, end='') Async RAG Pipeline ›
Use async to handle multiple concurrent queries efficiently in web servers or async apps.
import os
import asyncio
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
async def main():
loader = TextLoader("./docs/sample.txt")
docs = loader.load()
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
query = "What is LangChain?"
answer = await qa_chain.arun(query)
print(f"Query: {query}\nAnswer: {answer}")
asyncio.run(main()) Use Claude 3.5 Sonnet for RAG ›
Use Claude 3.5 Sonnet for higher coding accuracy or alternative LLM preference.
import anthropic
import os
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
loader = TextLoader("./docs/sample.txt")
docs = loader.load()
# Use OpenAI embeddings for vectorstore
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
# Define a wrapper to call Claude
class ClaudeLLM:
def __init__(self):
self.client = client
self.model = "claude-3-5-sonnet-20241022"
def __call__(self, prompt):
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
llm = ClaudeLLM()
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
query = "Explain RAG pipeline."
answer = qa_chain.run(query)
print(f"Query: {query}\nAnswer: {answer}") Performance
Latency~800ms for gpt-4o non-streaming, ~400ms for embeddings + retrieval
Cost~$0.002 per 500 tokens for gpt-4o, embeddings cost extra per 1000 tokens
Rate limitsTier 1: 500 RPM / 30K TPM for OpenAI API
- Limit document chunk size to reduce embedding tokens.
- Cache embeddings for static documents.
- Use lower temperature for deterministic answers.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard RAG with gpt-4o | ~800ms | ~$0.002 | Balanced accuracy and cost |
| Streaming RAG | Starts immediately, ~800ms total | ~$0.002 | Better UX for long answers |
| Async RAG | ~800ms per call, concurrent | ~$0.002 | High concurrency environments |
| Claude 3.5 Sonnet RAG | ~900ms | ~$0.0025 | Best coding and reasoning accuracy |
Quick tip
Pre-embed your documents and use FAISS vectorstore to speed up retrieval in your RAG pipeline.
Common mistake
Not setting the retriever properly in the RetrievalQA chain, causing the LLM to ignore document context.