Llama RAG pipeline Python example
Direct answer
Use the OpenAI-compatible SDK with a vector store like FAISS and a Llama model from Groq or Together AI to build a retrieval-augmented generation (RAG) pipeline in Python with client.chat.completions.create calls.
Setup
Install
pip install openai faiss-cpu langchain langchain_community Env vars
OPENAI_API_KEYGROQ_API_KEY Imports
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.chains import RetrievalQA
from langchain_core.schema import Document Examples
inQuery: 'What is retrieval-augmented generation?'
outAnswer: 'Retrieval-augmented generation (RAG) combines vector search with LLMs to provide accurate, context-aware answers by retrieving relevant documents and generating responses.'
inQuery: 'Explain Llama model usage in RAG pipelines.'
outAnswer: 'Llama models can be used as the generative LLM in RAG pipelines by integrating with vector stores like FAISS to retrieve context and generate precise answers.'
inQuery: 'How to handle empty search results in RAG?'
outAnswer: 'Implement fallback logic to handle empty retrievals, such as default responses or querying the LLM without context.'
Integration steps
- Initialize the OpenAI-compatible client with the Llama model and API key from os.environ
- Load and embed documents into a FAISS vector store for retrieval
- Create a retrieval-based QA chain combining the vector store retriever and Llama chat model
- Invoke the chain with a user query to retrieve relevant documents and generate an answer
- Print the generated answer to the console
Full code
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
from langchain_core.schema import Document
# Initialize OpenAI-compatible client for Groq Llama model
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
# Load documents from local text files
loader = TextLoader("./docs") # folder with text files
raw_docs = loader.load()
# Create embeddings for documents using OpenAI embeddings
embedding_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = embedding_client.embeddings.create(model="text-embedding-3-small", input=[doc.page_content for doc in raw_docs])
embeddings = [item.embedding for item in response.data]
# Build FAISS vector store
vector_store = FAISS.from_documents(raw_docs, embeddings)
# Setup retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})
# Setup Llama chat model via OpenAI-compatible client
llama_chat = ChatOpenAI(client=client, model="llama-3.3-70b-versatile", temperature=0.0)
# Create RetrievalQA chain
qa_chain = RetrievalQA(llm=llama_chat, retriever=retriever)
# Query example
query = "What is retrieval-augmented generation?"
answer = qa_chain.invoke({"input": query})
print(f"Query: {query}")
print(f"Answer: {answer}") API trace
Request
{"model": "llama-3.3-70b-versatile", "messages": [{"role": "user", "content": "<retrieved context>\nQuestion: What is retrieval-augmented generation?"}]} Response
{"choices": [{"message": {"content": "Retrieval-augmented generation (RAG) combines vector search with large language models to provide accurate, context-aware answers by retrieving relevant documents and generating responses."}}]} Extract
response.choices[0].message.contentVariants
Streaming response version ›
Use streaming for better user experience with long answers or interactive applications.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain retrieval-augmented generation."}],
stream=True
)
for chunk in response:
print(chunk.choices[0].message.content, end='')
print() Async version with LangChain ›
Use async when integrating into asynchronous web servers or concurrent workflows.
import os
import asyncio
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
async def main():
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
loader = TextLoader("./docs")
raw_docs = loader.load()
embedding_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = embedding_client.embeddings.create(model="text-embedding-3-small", input=[doc.page_content for doc in raw_docs])
embeddings = [item.embedding for item in response.data]
vector_store = FAISS.from_documents(raw_docs, embeddings)
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})
llama_chat = ChatOpenAI(client=client, model="llama-3.3-70b-versatile", temperature=0.0)
qa_chain = RetrievalQA(llm=llama_chat, retriever=retriever)
answer = await qa_chain.invoke_async({"input": "What is retrieval-augmented generation?"})
print(f"Answer: {answer}")
asyncio.run(main()) Alternative model: Together AI Llama ›
Use Together AI Llama for a strong instruct-tuned Llama model with good cost-performance balance.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Explain retrieval-augmented generation."}]
)
print(response.choices[0].message.content) Performance
Latency~2-5 seconds per query for llama-3.3-70b with retrieval
Cost~$0.03 per 1000 tokens for llama-3.3-70b via Groq, plus embedding costs
Rate limitsTypically 60 RPM and 60,000 TPM on Groq API; check provider limits
- Limit retrieved documents to top 3-5 to reduce prompt size
- Use concise prompts and system instructions to save tokens
- Cache embeddings and reuse vector store to avoid recomputing
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard RAG with llama-3.3-70b | ~3s | ~$0.03 | High-quality, accurate answers |
| Streaming RAG response | ~3s + stream | ~$0.03 | Interactive apps with long answers |
| Async RAG pipeline | ~3s concurrent | ~$0.03 | Web servers and concurrent calls |
| Together AI Llama model | ~2.5s | ~$0.025 | Cost-effective instruct-tuned Llama |
Quick tip
Use a vector store like FAISS with OpenAI embeddings to efficiently retrieve relevant documents before querying the Llama model for accurate RAG results.
Common mistake
Beginners often forget to embed documents before indexing in FAISS, causing retrieval to fail or return irrelevant results.