Code intermediate · 3 min read

How to build a RAG pipeline with LangChain

Direct answer
Use LangChain's document loaders, OpenAI embeddings, and FAISS vectorstore to build a RAG pipeline that retrieves relevant documents and generates answers with ChatOpenAI.

Setup

Install
bash
pip install langchain_openai langchain_community faiss-cpu
Env vars
OPENAI_API_KEY
Imports
python
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
from langchain_core.prompts import ChatPromptTemplate

Examples

inWhat is LangChain?
outLangChain is a framework for building applications with LLMs through composable components like document loaders, vectorstores, and chains.
inExplain RAG pipeline in LangChain.
outA RAG pipeline in LangChain loads documents, creates embeddings, stores them in a vectorstore, retrieves relevant docs for a query, and generates answers using an LLM.
inHow to handle empty query?
outThe pipeline returns a default message or empty response if no relevant documents are found for the query.

Integration steps

  1. Install required packages and set OPENAI_API_KEY in environment variables.
  2. Load your documents using a LangChain document loader like TextLoader.
  3. Create embeddings for documents with OpenAIEmbeddings.
  4. Index embeddings in a FAISS vectorstore for efficient similarity search.
  5. Initialize a ChatOpenAI model for answer generation.
  6. Build a RetrievalQA chain combining the retriever and LLM to answer queries.

Full code

python
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA

# Load documents from local text files
loader = TextLoader("./docs/sample.txt")
docs = loader.load()

# Create embeddings for documents
embeddings = OpenAIEmbeddings()

# Build FAISS vectorstore from documents
vectorstore = FAISS.from_documents(docs, embeddings)

# Initialize ChatOpenAI model
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Create RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# Query example
query = "What is LangChain?"
answer = qa_chain.run(query)

print(f"Query: {query}\nAnswer: {answer}")
output
Query: What is LangChain?
Answer: LangChain is a framework that helps developers build applications with large language models by combining document loading, vector search, and LLMs for generation.

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "<retrieved context>\nQuestion: What is LangChain?"}]}
Response
json
{"choices": [{"message": {"content": "LangChain is a framework that helps developers build applications with large language models..."}}], "usage": {"total_tokens": 150}}
Extractresponse.choices[0].message.content

Variants

Streaming RAG Pipeline

Use streaming to provide real-time token-by-token output for better user experience on long answers.

python
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA

loader = TextLoader("./docs/sample.txt")
docs = loader.load()
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)

llm = ChatOpenAI(model="gpt-4o", streaming=True, temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

query = "Explain RAG pipeline."
for token in qa_chain.stream(query):
    print(token, end='')
Async RAG Pipeline

Use async to handle multiple concurrent queries efficiently in web servers or async apps.

python
import os
import asyncio
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA

async def main():
    loader = TextLoader("./docs/sample.txt")
    docs = loader.load()
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(docs, embeddings)

    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

    query = "What is LangChain?"
    answer = await qa_chain.arun(query)
    print(f"Query: {query}\nAnswer: {answer}")

asyncio.run(main())
Use Claude 3.5 Sonnet for RAG

Use Claude 3.5 Sonnet for higher coding accuracy or alternative LLM preference.

python
import anthropic
import os
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

loader = TextLoader("./docs/sample.txt")
docs = loader.load()

# Use OpenAI embeddings for vectorstore
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)

# Define a wrapper to call Claude
class ClaudeLLM:
    def __init__(self):
        self.client = client
        self.model = "claude-3-5-sonnet-20241022"

    def __call__(self, prompt):
        response = self.client.messages.create(
            model=self.model,
            max_tokens=1024,
            system="You are a helpful assistant.",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

llm = ClaudeLLM()
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

query = "Explain RAG pipeline."
answer = qa_chain.run(query)
print(f"Query: {query}\nAnswer: {answer}")

Performance

Latency~800ms for gpt-4o non-streaming, ~400ms for embeddings + retrieval
Cost~$0.002 per 500 tokens for gpt-4o, embeddings cost extra per 1000 tokens
Rate limitsTier 1: 500 RPM / 30K TPM for OpenAI API
  • Limit document chunk size to reduce embedding tokens.
  • Cache embeddings for static documents.
  • Use lower temperature for deterministic answers.
ApproachLatencyCost/callBest for
Standard RAG with gpt-4o~800ms~$0.002Balanced accuracy and cost
Streaming RAGStarts immediately, ~800ms total~$0.002Better UX for long answers
Async RAG~800ms per call, concurrent~$0.002High concurrency environments
Claude 3.5 Sonnet RAG~900ms~$0.0025Best coding and reasoning accuracy

Quick tip

Pre-embed your documents and use FAISS vectorstore to speed up retrieval in your RAG pipeline.

Common mistake

Not setting the retriever properly in the RetrievalQA chain, causing the LLM to ignore document context.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗