Code intermediate · 3 min read

Llama RAG pipeline Python example

Direct answer
Use the OpenAI-compatible SDK with a vector store like FAISS and a Llama model from Groq or Together AI to build a retrieval-augmented generation (RAG) pipeline in Python with client.chat.completions.create calls.

Setup

Install
bash
pip install openai faiss-cpu langchain langchain_community
Env vars
OPENAI_API_KEYGROQ_API_KEY
Imports
python
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.chains import RetrievalQA
from langchain_core.schema import Document

Examples

inQuery: 'What is retrieval-augmented generation?'
outAnswer: 'Retrieval-augmented generation (RAG) combines vector search with LLMs to provide accurate, context-aware answers by retrieving relevant documents and generating responses.'
inQuery: 'Explain Llama model usage in RAG pipelines.'
outAnswer: 'Llama models can be used as the generative LLM in RAG pipelines by integrating with vector stores like FAISS to retrieve context and generate precise answers.'
inQuery: 'How to handle empty search results in RAG?'
outAnswer: 'Implement fallback logic to handle empty retrievals, such as default responses or querying the LLM without context.'

Integration steps

  1. Initialize the OpenAI-compatible client with the Llama model and API key from os.environ
  2. Load and embed documents into a FAISS vector store for retrieval
  3. Create a retrieval-based QA chain combining the vector store retriever and Llama chat model
  4. Invoke the chain with a user query to retrieve relevant documents and generate an answer
  5. Print the generated answer to the console

Full code

python
import os
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA
from langchain_core.schema import Document

# Initialize OpenAI-compatible client for Groq Llama model
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")

# Load documents from local text files
loader = TextLoader("./docs")  # folder with text files
raw_docs = loader.load()

# Create embeddings for documents using OpenAI embeddings
embedding_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = embedding_client.embeddings.create(model="text-embedding-3-small", input=[doc.page_content for doc in raw_docs])
embeddings = [item.embedding for item in response.data]

# Build FAISS vector store
vector_store = FAISS.from_documents(raw_docs, embeddings)

# Setup retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Setup Llama chat model via OpenAI-compatible client
llama_chat = ChatOpenAI(client=client, model="llama-3.3-70b-versatile", temperature=0.0)

# Create RetrievalQA chain
qa_chain = RetrievalQA(llm=llama_chat, retriever=retriever)

# Query example
query = "What is retrieval-augmented generation?"
answer = qa_chain.invoke({"input": query})

print(f"Query: {query}")
print(f"Answer: {answer}")

API trace

Request
json
{"model": "llama-3.3-70b-versatile", "messages": [{"role": "user", "content": "<retrieved context>\nQuestion: What is retrieval-augmented generation?"}]}
Response
json
{"choices": [{"message": {"content": "Retrieval-augmented generation (RAG) combines vector search with large language models to provide accurate, context-aware answers by retrieving relevant documents and generating responses."}}]}
Extractresponse.choices[0].message.content

Variants

Streaming response version

Use streaming for better user experience with long answers or interactive applications.

python
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain retrieval-augmented generation."}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].message.content, end='')
print()
Async version with LangChain

Use async when integrating into asynchronous web servers or concurrent workflows.

python
import os
import asyncio
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_core.chains import RetrievalQA

async def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    loader = TextLoader("./docs")
    raw_docs = loader.load()
    embedding_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = embedding_client.embeddings.create(model="text-embedding-3-small", input=[doc.page_content for doc in raw_docs])
    embeddings = [item.embedding for item in response.data]
    vector_store = FAISS.from_documents(raw_docs, embeddings)
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})
    llama_chat = ChatOpenAI(client=client, model="llama-3.3-70b-versatile", temperature=0.0)
    qa_chain = RetrievalQA(llm=llama_chat, retriever=retriever)
    answer = await qa_chain.invoke_async({"input": "What is retrieval-augmented generation?"})
    print(f"Answer: {answer}")

asyncio.run(main())
Alternative model: Together AI Llama

Use Together AI Llama for a strong instruct-tuned Llama model with good cost-performance balance.

python
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Explain retrieval-augmented generation."}]
)
print(response.choices[0].message.content)

Performance

Latency~2-5 seconds per query for llama-3.3-70b with retrieval
Cost~$0.03 per 1000 tokens for llama-3.3-70b via Groq, plus embedding costs
Rate limitsTypically 60 RPM and 60,000 TPM on Groq API; check provider limits
  • Limit retrieved documents to top 3-5 to reduce prompt size
  • Use concise prompts and system instructions to save tokens
  • Cache embeddings and reuse vector store to avoid recomputing
ApproachLatencyCost/callBest for
Standard RAG with llama-3.3-70b~3s~$0.03High-quality, accurate answers
Streaming RAG response~3s + stream~$0.03Interactive apps with long answers
Async RAG pipeline~3s concurrent~$0.03Web servers and concurrent calls
Together AI Llama model~2.5s~$0.025Cost-effective instruct-tuned Llama

Quick tip

Use a vector store like FAISS with OpenAI embeddings to efficiently retrieve relevant documents before querying the Llama model for accurate RAG results.

Common mistake

Beginners often forget to embed documents before indexing in FAISS, causing retrieval to fail or return irrelevant results.

Verified 2026-04 · llama-3.3-70b-versatile, meta-llama/Llama-3.3-70B-Instruct-Turbo, text-embedding-3-small
Verify ↗