How to build RAG with Vertex AI
Quick answer
Build Retrieval-Augmented Generation (RAG) with
Vertex AI by first embedding your documents using OpenAIEmbeddings, indexing them in a vector store like FAISS, then querying the vector store to retrieve relevant context and passing it to a Vertex AI Gemini chat model for generation. Use the vertexai Python SDK to integrate embedding, retrieval, and chat completion steps seamlessly.PREREQUISITES
Python 3.8+Google Cloud project with Vertex AI enabledService account with Vertex AI permissionspip install vertexai langchain langchain_community faiss-cpu openaiSet GOOGLE_APPLICATION_CREDENTIALS environment variable for authentication
Setup
Install the required Python packages and set up authentication for Google Vertex AI. You need to enable Vertex AI API in your Google Cloud project and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to your service account JSON key file.
pip install vertexai langchain langchain_community faiss-cpu openai Step by step
This example shows how to embed documents, create a FAISS vector store, query it with a user question, and generate an answer using Vertex AI's Gemini chat model.
import os
from vertexai import init
from vertexai.generative_models import GenerativeModel
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
# Initialize Vertex AI with your project and location
init(project=os.environ['GOOGLE_CLOUD_PROJECT'], location='us-central1')
# Sample documents to index
texts = [
"Vertex AI is Google Cloud's managed ML platform.",
"RAG combines retrieval with generation for better answers.",
"FAISS is a popular vector search library.",
"Gemini models provide powerful chat capabilities."
]
# Create LangChain Document objects
docs = [Document(page_content=text) for text in texts]
# Create embeddings using OpenAI embeddings (can also use Vertex AI embeddings if available)
embeddings = OpenAIEmbeddings()
# Build FAISS vector store from documents
vector_store = FAISS.from_documents(docs, embeddings)
# User query
query = "What is Vertex AI?"
# Retrieve top 2 relevant documents
retrieved_docs = vector_store.similarity_search(query, k=2)
# Combine retrieved context
context = "\n".join([doc.page_content for doc in retrieved_docs])
# Prepare prompt with context
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {query}\nAnswer:"
# Load Vertex AI Gemini generative model
model = GenerativeModel('gemini-2.0-flash')
# Generate answer
response = model.generate_content(prompt, max_output_tokens=256)
print("Answer:", response.text.strip()) output
Answer: Vertex AI is Google Cloud's managed ML platform that enables building, deploying, and scaling machine learning models.
Common variations
- Use
vertexai.generative_models.GenerativeModelfor chat-based RAG with conversational context. - Replace
OpenAIEmbeddingswith Vertex AI embeddings when available for tighter integration. - Use asynchronous calls with
asyncioandawaitfor scalable applications. - Switch vector stores to
Chromaor cloud-hosted vector DBs for large-scale retrieval.
import asyncio
from vertexai import init
from vertexai.generative_models import GenerativeModel
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
async def rag_async():
init(project=os.environ['GOOGLE_CLOUD_PROJECT'], location='us-central1')
texts = ["Vertex AI is a managed ML platform.", "RAG improves generation with retrieval."]
docs = [Document(page_content=t) for t in texts]
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(docs, embeddings)
query = "Explain RAG"
retrieved_docs = vector_store.similarity_search(query, k=1)
context = "\n".join([doc.page_content for doc in retrieved_docs])
model = GenerativeModel('gemini-2.0-flash')
prompt = f"Context:\n{context}\nQuestion: {query}\nAnswer:"
# Async generation
response = await model.generate_content(prompt, max_output_tokens=256)
print("Async answer:", response.text.strip())
import asyncio
asyncio.run(rag_async()) output
Async answer: RAG stands for Retrieval-Augmented Generation, a technique that improves language model responses by retrieving relevant documents to provide context.
Troubleshooting
- If you get authentication errors, verify
GOOGLE_APPLICATION_CREDENTIALSpoints to a valid service account JSON with Vertex AI permissions. - If embeddings or vector search return no results, check that documents are properly indexed and embedding model is compatible.
- For quota or API errors, ensure your Google Cloud project has Vertex AI enabled and billing is active.
- Model names like
gemini-2.0-flashmay update; check Vertex AI models for current names.
Key Takeaways
- Use Vertex AI's Gemini models combined with vector search for effective RAG implementations.
- Leverage LangChain and FAISS for easy document embedding and retrieval integration.
- Always authenticate with a service account and set environment variables correctly for Vertex AI.
- Consider async and chat-based models for scalable and conversational RAG applications.